Space-Time Light Field Rendering

advertisement
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,
VOL. 13,
NO. 4,
JULY/AUGUST 2007
697
Space-Time Light Field Rendering
Huamin Wang, Student Member, IEEE Computer Society,
Mingxuan Sun, Student Member, IEEE, and Ruigang Yang, Member, IEEE
Abstract—In this paper, we propose a novel framework called space-time light field rendering, which allows continuous exploration of
a dynamic scene in both space and time. Compared to existing light field capture/rendering systems, it offers the capability of using
unsynchronized video inputs and the added freedom of controlling the visualization in the temporal domain, such as smooth slow
motion and temporal integration. In order to synthesize novel views from any viewpoint at any time instant, we develop a two-stage
rendering algorithm. We first interpolate in the temporal domain to generate globally synchronized images using a robust spatialtemporal image registration algorithm followed by edge-preserving image morphing. We then interpolate these software-synchronized
images in the spatial domain to synthesize the final view. In addition, we introduce a very accurate and robust algorithm to estimate
subframe temporal offsets among input video sequences. Experimental results from unsynchronized videos with or without time
stamps show that our approach is capable of maintaining photorealistic quality from a variety of real scenes.
Index Terms—Image-based rendering, space-time light field, epipolar constraint, image morphing.
Ç
1
INTRODUCTION
D
the last few years, image-based modeling and
rendering (IBMR) has become a popular alternative to
synthesizing photorealistic images of real scenes, whose
complexity and subtleties are difficult to capture with
traditional geometry-based techniques. Among various
IBMR approaches, light field rendering (LFR) [1], [2] is a
representative one that uses many images captured from
different viewpoints of the scene of interests, that is, a light
field. Given a light field, the rendering process becomes a
simple operation of rearranging recorded pixel intensities.
The process can, in theory, guarantee rendering quality for
any type of scenes without resorting to difficult and often
fragile depth reconstruction approaches.
With technological advancement, it is now possible to
build camera arrays to capture a dynamic light field for
moving scenes [3], [4], [5], [6]. However, most existing
camera systems rely on synchronized input: Each set of
images taken at the same instant is treated as a separate
light field. Although a viewer can continuously change
the viewpoint in the spatial domain within each static
light field, exploring in the temporal domain is still
discrete. The only exception we have found so far is the
high-performance light field array from Stanford [7] in
which customized cameras are triggered in a staggered
pattern to increase the temporal sampling rate. The
triggering pattern is carefully designed to accommodate
the expected object motion.
URING
. H. Wang and M. Sun are with the Georgia Institute of Technology, Tech
Square Research Building, 85 5th Street NW, Atlanta, GA 30332-0760.
E-mail: {whmin, cynthia}@cc.gatech.edu.
. R. Yang is with the Department of Computer Science, University of
Kentucky, One Quality Street, Suite 800, Lexington, KY 40507.
E-mail: ryang@cs.uky.edu.
Manuscript received 7 June 2006; accepted 25 Oct. 2006; published online 17
Jan. 2007.
For information on obtaining reprints of this article, please send e-mail to:
tvcg@computer.org, and reference IEEECS Log Number TVCG-0084-0606.
Digital Object Identifier no. 10.1109/TVCG.2007.1019.
1077-2626/07/$25.00 ß 2007 IEEE
In this paper, we present the notion of space-time light
field rendering (ST-LFR) that allows continuous exploration of
a dynamic light field in both the spatial and temporal
domain. A space-time light field is defined as a collection of
video sequences captured by a set of tightly packed cameras
that may or may not be synchronized. The basic goal of STLFR is to synthesize novel images from an arbitrary
viewpoint at an arbitrary time instant t. Traditional LFR is
therefore a special case of ST-LFR at some fixed time
instant. One may think that, given a sufficiently high
capture rate (that is, full National TV Standards Committee
(NTSC) rate of 30 fps), applying traditional LFR to the set of
images captured closest to t could generate reasonable
results. Unfortunately, a back-of-the-envelope calculation
shows that this is usually not true. Typical body motions
(approximately 0.6 m/s) at sitting distance can cause an
offset of more than 10 pixels per frame in a 30 fps VGA
sequence. As shown in Fig. 1b, the misalignment caused by
the subject’s casual (hand) motion is quite evident in the
synthesized image even when the temporal offset is less
than a half frame time, that is, 1/60 sec.
Although the idea of spatial-temporal view synthesis has
been explored in the computer vision literature with sparse
camera arrays (for example, [8] and [9]), we believe that it is
the first time such a goal has been articulated in the context
of LFR. Using many images from a densely packed camera
array ameliorates the difficulty in 3D reconstruction.
Nevertheless, we still have to deal with the fundamental
correspondence problem. Toward this end, we adopt the
following guidelines: 1) We only need to find matches for
salient features that are both critical for view synthesis and
easy to detect and 2) we cannot guarantee that automatically found correspondences are correct; therefore, we seek
ways to “hide” these mismatches during the final view
synthesis. Based on the above two guidelines, we are able to
synthesize photorealistic results without completely solving
the 3D reconstruction problem.
Published by the IEEE Computer Society
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
698
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,
VOL. 13,
NO. 4, JULY/AUGUST 2007
A novel procedure to estimate temporal offsets
among input video sequences. It is very robust and
accurate. Our test using video sequences with
ground-truth time stamps shows the error in
temporal offset is usually less than 2 ms or 1/16 of
a frame time (captured at 30 fps).
Fig. 1 compares results from our ST-LFR approach with
traditional LFR. Note that ST-LFR not only can synthesize
realistic images for any given time instant, but also can
visualize smooth motion trajectory by integrating images in
the temporal domain. The improvements are much more
obvious in the video at http://vis.uky.edu/~gravity/Research/st-lfr/
st-lfr.htm.
.
2
Fig. 1. (a) A camera array we used to capture video sequences. The
cameras are running at 30 fps without intercamera synchronization (that
is, genlock). (b) Pictures synthesized using traditional LFR. (c) Pictures
synthesized using ST-LFR. The bottom pictures show the visualization
of a motion trajectory. We overlay images for each sequence, which is
similar to temporal integration. These integrated images are used as
input for LFR. The trajectories (for example, from the highlights) in (b)
are much smoother since in-between frames are accurately synthesized
and used for temporally integrated images.
Our approach to ST-LFR can be decomposed into two
stages. First, we interpolate original video sequences in the
temporal domain to synthesize images for a given time
instant. With these virtually synchronized images, we use
traditional LFR techniques to render novel views in the
spatial domain. The major technical difficulty of ST-LFR is
in the first step: synthesizing “in-between” frames within
each sequence, toward which we made the following
contributions:
.
.
A robust spatial-temporal image registration algorithm using optical flow and the epipolar constraint.
The number of mismatches can be greatly reduced
after enforcing both intersequence and intrasequence matching consistency. Furthermore, the
spatial correspondences can be used to construct a
3D geometric proxy that improves the depth of field
in LFR.
A robust temporal image interpolation algorithm
that provides superior results in the presence of
inevitable mismatches. Our algorithm, based on
image morphing [10], incorporates a varying weight
term using image gradient to preserve edges.
RELATED WORK
2.1 Light-Fielded-Based Rendering (LFR)
The class of light-field-based rendering techniques is
formulated around the plenoptic function, which describes
all of the radiant energy that can be perceived by an
observer at any point in space and time [11]. Levoy and
Hanrahan [1] pointed out that the plenoptic function can be
simplified in the regions of free space (that is, space free of
occluders), since radiant energy does not change over free
space. The reduced function (the so-called light field) can be
interpreted as a function on the space of oriented light rays.
The light field can be directly recorded using multiple
cameras. Once acquired, LFR becomes a simple task of table
lookups to retrieve the recorded radiant intensities.
A major disadvantage of light-field-based techniques is
the huge amount of data required; therefore, their application was traditionally limited to static scenes [1], [2], [12],
[13]. The acquisition of dense dynamic light fields becomes
feasible only recently with technological advancement.
Most systems use a dense array of synchronized cameras to
acquire dynamic scenes at discreet temporal intervals [3],
[4], [5], [6]. Images at each time instant are collectively
treated as an independent light field. As a result, viewers
can explore continuously only in the spatial domain but not
in the temporal domain. To create the impression of
dynamic events, the capture frame rate has to be sufficiently
high, further increasing the demand for data storage and
processing. In addition, hardware camera synchronization
is a feature available only on high-end cameras, which
significantly increases the system cost. Without synchronization, misalignment problems will occur in fast-motion
scenarios, as reported in one of the few light field systems
built with low-cost commodity Web cameras [14].
Just recently, Wilburn et al. [7] presented a light field
camera array using custom-made hardware. Many interesting effects are demonstrated, such as seeing through
occluders and superresolution imagery. Particularly relevant to our work is to use staggered triggering to increase
the temporal resolution. All cameras are synchronized
under a global clock, but each is set to a potentially
different shutter delay. By combining images taken at
slightly different times, the synthesized images can produce
smooth temporal motion as if taken by a high-speed
camera. In order to reduce ghosting effect, images are
warped based on optical flow, which we will discuss more
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING
699
Fig. 2. The structure of our ST-LFR algorithm.
later. From a system standpoint, our ST-LFR framework is
more flexible since it does not require a known temporal
offset.
2.2 View Synthesis in Computer Vision
The task of view synthesis, that is, generating images from a
novel viewpoint given a sparse set of camera images, is an
active research topic in computer vision. Many of these
techniques focus on the dense geometric reconstruction of
the scene (for example, [15], [16], [17], and [18]). Although
some very impressive results have been obtained, the
problem of depth reconstruction remains open. In addition,
synchronized camera input is typically required. With
unsynchronized input, there are techniques to find the
temporal offset [19], [20], [21], [22], [23], but most of these
techniques do not provide subframe accuracy, that is, the
alignment resolution is one frame at the best, which, as we
have presented a simple calculation in Section 1, is not
adequate for view synthesis. Though subframe accuracy
was achieved, the proposed technique in [23] requires
regular capturing rate and a minimum of four frames as
input. Our method only requires three frames at any rate
and is extremely robust in the presence of feature matching
outliers. Furthermore, most of these alignment techniques
do not address the problem of synthesizing “in-between”
frames—a critical part of ST-LFR.
Our work is also related to superresolution techniques in
computer vision. Although most approaches focus on
increasing the spatial resolution (see [24] for a comprehensive review), a notable exception in [8] extends the notion of
superresolution to the space-time domain. Their approach
enables new visual capabilities of dynamic events such as
the removal of motion blur and temporal antialiasing.
However, they assumed that all images have already been
registered in the space and time volume, and the scene is
approximately planar. Our ST-LFR formulation does not
require a priori image registration and does allow arbitrary
scene and motion types. On the other hand, we do not
attempt to increase the spatial resolution.
One key component of ST-LFR is to find correspondences among images, which is achieved through optical
flow computation over both space and time. Optical flow is
a well-studied problem in computer vision. Interested
readers are referred to [25] and [26] for reviews and
comparison of different optical flow techniques. Although
flow algorithms typically use only a single sequence of
frames (two frames in most cases), there are a few
extensions to the spatial-temporal domain. Vedula et al.
introduced the concept of scene flow [27] to perform spatialtemporal voxel-based space carving [28] from a sparse
image set. With the voxel model, they perform ray casting
across space and time to generate novel views [9].
Under the context of LFR, Wilburn et al. [7] modified
the original optical flow by iteratively warping the nearest
four captured images across space and time toward the
virtual view. Their goal of synthesizing ghosting-free
views from images captured at different times was similar
to ours.1 It is also acknowledged in [7] that the results
suffered from the usual artifacts of optical flow. We go
one-step further. Instead of just warping the optical flow,
we develop a novel edge-guided temporal interpolation as
a remedy for optical flow errors, thus increasing the
robustness of view synthesis.
3
ALGORITHM OVERVIEW
Given a set of video sequences captured by a calibrated
camera array, our goal is to construct a space-time light field
so that dynamic scenes can be rendered at arbitrary time
from arbitrary viewing directions. We do not require those
video cameras to be synchronized, and in fact, they do not
even need to share the same capturing rate. We have
developed a set of techniques to establish spatial-temporal
correspondences across all images so that view synthesis
can proceed. More specifically, our approach to ST-LFR has
the following major steps, as shown in Fig. 2:
image registration,
temporal offset estimation (optional),
synchronized image generation (temporal interpolation), and
4. LFR (spatial interpolation).
In the first step (Section 4), we use a robust spatialtemporal optical flow algorithm to establish feature correspondences among successive frames for each camera’s video
sequence. In cases when video sequences are not synchronized or relative temporal offsets are unknown, we developed a simple but robust technique to estimate the offsets
(presented in Section 7), which in turn reduces mismatches.
Once feature correspondences are established, new globally
synchronized images are synthesized using a novel edgeguided image-morphing method (Section 5). Synchronized
1.
2.
3.
1. Note that [7] is in fact a concurrent development to ST-LFR. A
preliminary version of this paper was published in March 2005 [29].
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
700
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,
VOL. 13,
NO. 4, JULY/AUGUST 2007
video sequences are eventually used as inputs to synthesize
images in the spatial domain using traditional LFR techniques (Section 6).
4
IMAGE REGISTRATION
Image registration in this context means finding pairwise
feature correspondences among input videos. It is the
most critical part of the entire ST-LFR pipeline. We first
discuss how to find feature correspondences given two
frames using optical flow, then show how the epipolar
constraint can be used to detect feature correspondence
errors, once the temporal offset between two frames is
learned (Section 7).
Throughout the paper, we use Ci to denote cameras
where the subscript is the camera index, and Ii ðtÞ to denote
an image taken from camera Ci at time t. Note that t does
not have to be based on a real clock. We typically use a
frame time (that is, 1/30 sec) as the unit time.
4.1 Bidirectional Flow Computation
Given two images Ii ðt1 Þ and Ij ðt2 Þ, we calculate optical
flows in both directions: forward from Ii ðt1 Þ to Ij ðt2 Þ and
backward from Ij ðt2 Þ to Ii ðt1 Þ. Note that the two frames can
be captured by either the same camera at different times,
resulting in the temporal flow, or two different cameras,
resulting in the spatial flow. The bidirectional flow
computation provides more feature correspondences compared to the one-way flow. To acquire the forward flow, we
first select corners with large eigenvalues on Ii ðt1 Þ as feature
points using a Harris corner detector [30]. We enforce a
minimum distance between any two feature points to
prevent them from gathering in a small high-gradient
region. We then calculate the sparse optical flow from Ii ðt1 Þ
to Ij ðt2 Þ using tracking functions in Open Source Computer
Vision Library (OpenCV) [31], which is based on the classic
Kanade-Lucas-Tomasi (KLT) feature tracker [32], [33].
Similarly, to get the backward flow from Ij ðt2 Þ to Ii ðt1 Þ,
we first find features on image Ij ðt2 Þ, then calculate the
optical flow from Ij ðt2 Þ and Ii ðt1 Þ. Redundant feature
correspondences from both flow computations are removed
once they are found to be too close to each other.
We choose not to calculate a dense per-pixel flow since
there are certain areas such as occlusion boundaries and
textureless regions where flow computation is known to be
problematic. Our sparse flow formulation is not only more
robust, but also more computationally efficient. Moreover,
since our temporal interpolation algorithm is based on
salient edge features on boundaries, which are mostly
covered by our sparse flows, occluded and textureless
regions can still be correctly synthesized even if there are no
feature correspondences calculated within.
4.2 Outlier Detection Using the Epipolar Constraint
Given two frames Ii ðt1 Þ and Ij ðt2 Þ ði 6¼ jÞ, we can use the
epipolar constraint to detect and later remove wrong feature
correspondences. The epipolar constraint dictates that, for
any 3D point P in the world space, P ’s homogenous
projection locations pi and pj on two cameras should satisfy
the equation
pi Fpj ¼ 0;
ð1Þ
Fig. 3. The epipolar constraint is applied to detect bad matches in the
spatial-temporal correspondences, assuming an unsynchronized input.
where F is called the fundamental matrix, which can be
determined by camera projection matrices. If we consider
this equation solely in 2D image space, then pi F can also be
explained as a 2D epipolar line in camera Cj , and (1) means
that pj must be on the epipolar line pi F, and vice versa.
When Ii ðt1 Þ and Ij ðt2 Þ are captured at the same time (that
is, t1 ¼ t2 ), we assume that feature correspondences pi and
pj belong to the projections of the same 3D point so that we
can directly use (1) to justify whether the correspondence is
correct. Due to various error sources such as inaccurate
camera calibration and camera noise, we set an error
threshold . If jpi Fpj j > , we consider this feature correspondence to be wrong and remove it from our correspondence list.
When Ii ðt1 Þ and Ij ðt2 Þ are captured at different but
known time instants, we can use linear interpolation to
estimate the feature location over time, assuming the
motion is temporarily linear. As shown in Fig. 3, given
the temporal correspondence pi ðt1 Þ and pi ðt1 þ tÞ,
pi ’s location at t2 ðt1 < t2 < t1 þ tÞ is
pi ðt2 Þ ¼
t1 þ t t2
t2 t1
pi ðt1 Þ þ
pi ðt1 þ tÞ:
t
t
ð2Þ
Using the interpolated pi ðt2 Þ, the spatial correspondence
between pi ðt1 Þ and pj ðt2 Þ can be verified as before:
j pi ðt2 ÞFpj ðt2 Þ j :
ð3Þ
If a pair of interpolated spatial correspondence does not
satisfy the epipolar constraint, it means that either the
temporal correspondence or spatial correspondence, or both
might be wrong. Since we cannot tell which one may
possibly be correct, we will discard both of them. In this
way, the epipolar constraint is used to detect not only
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING
701
reference frames in order to remove correspondence errors
without causing too many false negatives.
Reference cameras spatially close to the original camera
(that is, camera i) in the camera array are good candidates
since there would be less spatial correspondence errors
caused by occlusions. The ambiguity along the epipolar line
should also be brought into consideration: When the feature
is moving horizontally, cameras on the same horizontal line
should not be used since the feature is almost staying on the
horizontal epipolar line. In addition, for camera arrays built
on regular grids, it is not necessary to use more cameras on
the same column or row because they will share the similar
epipolar line.
Given a selected reference camera, in order to fully
reveal correspondence errors calculated by the forward
flow, we prefer reference frames with capturing time t2
close to t1 þ t, because the flow is calculated from known
feature locations on Ii ðt1 Þ to Ii ðt1 þ tÞ. Similarly, we prefer
reference frames captured near t1 when examining correspondences calculated by the backward flow.
In order to evaluate camera i’s temporal correspondence,
our reference camera selection scheme calculates a quality
value wj for each camera j using the following equation:
wj ¼ DðjÞ þ jt2 tj;
Fig. 4. Feature points and epipolar line constraints. Red labels indicate
wrong correspondences, whereas blue labels indicate right correspondences. (a) pi ðt1 Þ on image Ii ðt1 Þ. (b) pi ðt1 þ tÞ on Ii ðt1 þ tÞ. (c) pj ðt2 Þ
and pi ðt2 Þ’s epipolar lines on image Ij ðt2 Þ.
ð4Þ
where DðjÞ measures the spatial closeness between camera
i and camera j, t2 is the closest temporal frame to t from
camera j’s sequence, that is,
t2 ¼ arg min jt2 tj;
ð5Þ
t2 2Sequence j
spatial correspondence errors, but also temporal correspondence errors.
This error detection method assumes that the motion
pattern is roughly linear in the projective space. Many realworld movements, including rotation, do not strictly satisfy
this requirement. Fortunately, when cameras have a
sufficiently high rate with respect to the 3D motion, such
a locally temporal linearization is generally acceptable [34].
Our experimental results also support this assumption. In
Fig. 4, correct feature correspondences satisfy the epipolar
constraint well even though the magazine was rotating fast.
Fig. 4 also demonstrates the amount of pixel offsets that
casual motion could introduce: In two successive frames
captured at 30 fps, many feature points move more than
20 pixels—a substantial amount that cannot be ignored in
view synthesis.
4.3
Detecting Correspondence Errors in
Multicamera Systems
As discussed above, we need to have three frames in order
to detect bad matches. Given a temporal pair of Ii ðt1 Þ and
Ii ðt1 þ tÞ, we need to select a reference camera j that
captures Ij ðt2 Þ with t1 t2 t1 þ t. Ideally, the more
reference cameras we use, the more errors we can detect.
However, the number of false negatives increases with the
number of reference cameras as well; for example, correct
temporal correspondences are removed due to wrong
spatial correspondences caused by reference cameras.
Therefore, we need a selection scheme to choose good
and is a parameter to balance the influence between
spatial closeness and the time difference. Since we detect
feature correspondence errors calculated from the backward flow and forward flow separately, correspondences
constructed by different flows use different reference
frames and cameras too. t is either t1 or t1 þ t, depending
on which flow is going to be examined. If the camera array
is constructed on regular grids, DðjÞ can be simply
expressed as the distance between array indices. In that
case, we choose one reference camera along the column and
one along the row to provide both horizontal and vertical
epipolar constraints. When cameras share the same frame
rate, reference cameras only have to be selected once.
5
TEMPORAL INTERPOLATION
GUIDANCE
WITH
EDGE
Once we have calculated the temporal correspondence
between two successive frames from one camera, we
would like to generate intermediate frames at any time
for this camera so that synchronized video sequences can
be produced for traditional LFR. One possible approach is
to simply triangulate the whole image and blend two
images by texture mapping. However, the result quality
is quite unpredictable since it depends heavily on the
local triangulation of feature points. Even a single feature
mismatch can affect a large neighboring region on the
image.
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
702
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,
Fig. 5. Real feature edges. (a) The gradient magnitude map. (b) Feature
edges (blue) detected by testing feature points pairwise.
Our robust temporal interpolation scheme, based on the
image-morphing method [10], incorporates varying weights
using the image gradient. In essence, line segments are
formed automatically from point correspondences to cover
the entire image. Segments aligned with image edges will
gain extra weights to preserve edge straightness—an
important visual cue. Therefore, our method can synthesize
smooth and visually appealing images even in the presence
of missing or wrong feature correspondence.
VOL. 13,
NO. 4, JULY/AUGUST 2007
Fig. 6. The optical flow (a) before and (b) after applying the epipolar
constraint. Correspondence errors on the top of the magazine are
detected and removed.
ð6Þ
temporal correspondences. Nevertheless, the edge length
change may be caused by reasonable point correspondences
as well. For instance, in Fig. 6, the dashed blue arrows show
the intersection boundaries between the magazine and the
wall. They are correctly tracked; however, since they do not
represent real 3D points, they do not follow the magazine
motion at all. Edges that ended with those points can be
detected from length changes and can be removed to avoid
distortions during temporal interpolation. Similarly, we do
not consider segments connecting one static point (that is, a
point that does not move in two temporal frames) with one
dynamic point due to the segment length change.
A detail worth mentioning here is that we included a
maximum edge length constraint Lmax in our implementation. This constraint has nothing to do with the edge
correspondence correctness but rather serves to accelerate
the edge correspondence detection. Given a point feature p
as an endpoint of a possible edge, the other endpoint must
be contained in the circle centered at p with radius Lmax . We
build a kD-tree structure for feature points to quickly
discard any points outside of that circle. When Lmax is
relatively small compared with the frame resolution, the
computational cost can be greatly reduced from Oðn2 Þ to
OðmnÞ, where n is the total number of feature points, and m
is the average number of feature points clustered in circles
with radius Lmax .
The first term in (6) gives the gradient magnitude and the
second term indicates how well the gradient orientation
matches with the edge normal direction NðeÞ. is a
parameter to balance the influence between two terms.
We calculate the fitness values for both ei ðtÞ and ei ðt þ tÞ.
If both values are greater than a threshold, it means that
both ei ðtÞ and ei ðt þ tÞ are in strong gradient regions, and
the gradient direction matches the edge normal. Therefore,
ei ðtÞ and ei ðt þ tÞ may represent a real surface or texture
boundary in the scene. Fig. 5 shows a gradient magnitude
map and detected feature edges.
The second factor is the edge length. We assume that the
edge length is nearly constant from frame to frame, provided
that the object distortion is relatively slow compared with the
frame rate. This assumption is used to discard edge
correspondences if jei ðt þ tÞj changes too much from
jei ðtÞj. Most changes in edge length are caused by wrong
5.2 Forming the Edge Set
Since the real edge correspondences generated from
Section 5.1 may not be able to cover the whole region
sometimes, we add virtual edge correspondences to the
edge set to avoid distortions caused by sparse edge sets.
Virtual edges are created by triangulating the whole
image region using constrained Delaunay Triangulation
[36]. We treat the existing edge correspondences as edge
constraints and use the endpoints of each edge plus four
image corners as input vertices. Note that we have to
break intersecting feature edges since no intersection is
allowed among edge constraints for triangulation. Fig. 7
shows a sample triangulation result.
Sometimes, more than two feature points may be selected
on a thick and high-gradient band, so that more than one
feature edge may be generated (as shown in Fig. 8) even
though the band represents the same edge in the world
space. This is problematic since those extra feature edges
5.1 Extracting Edge Correspondences
We first construct segments by extracting edge correspondences. Given a segment ei ðtÞ connecting two feature points
on image Ii ðtÞ and a segment ei ðt þ tÞ connecting the
corresponding feature points on Ii ðt þ tÞ, we would like to
know whether this segment pair ðei ðtÞ; ei ðt þ tÞÞ are both
aligned with image edges, that is, forming an edge
correspondence. Our decision is based on two main factors:
edge strength and edge length.
Regarding edge strength, our algorithm examines
whether a potential edge matches with both image gradient
magnitude and orientation, similar to Canny edge detection
method [35]. On a given segment, we select a set of samples
s1 ; s2 ; . . . ; sn uniformly and calculate the fitness function for
this segment as
fðeÞ ¼ minðjrIðsk Þj þ sk 2e
rIðsk Þ
NðeÞÞ:
jrIðsk Þj
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING
703
Fig. 7. The edge map constructed after constrained Delaunay
triangulation.
will repetitively apply more influence weights than they
should during the next image-morphing step. We detect this
by testing whether real feature edges are nearly parallel and
adjacent. When detected, we keep only the longest one.
5.3 Morphing
We extend the image-morphing method [10] to interpolate
two frames temporally. We allow real edges to have more
influence weights than virtual edges since real edges are
believed to have physical meanings in the real world. In
[10], the edge weight was calculated using the edge length
L and the point-edge distance D:
b
L
;
ð7Þ
weight0 ¼
ða þ DÞ
Fig. 9. Interpolation result (a) without and (b) with virtual feature edges.
Fig. 9 shows the results without and with virtual edges.
Fig. 10 shows the results using different interpolation
schemes. Since some features are missing on the top of
the magazine, the interpolation quality improves when real
feature edges get extra weights according to (8). Even
without additional weights, image morphing generates
more visually pleasing images in the presence of bad
feature correspondences.
6
VIEW SYNTHESIS (SPATIAL INTERPOLATION)
Once we generate synchronized images for a given time
instant, we can use traditional LFR techniques to synthesize
views from novel viewpoints. To this end, we adopt the
unstructured lumigraph rendering (ULR) technique [37],
where a, b, and are constants to affect the line influence.
We calculate the weight for both real and virtual feature
edges using the formula above. The weight for the real edge
is then further boosted as
fðeÞ fmin ðeÞ ;
ð8Þ
weight ¼ weight0 1 þ
fmax ðeÞ fmin ðeÞ
where fðeÞ is the edge samples’ average fitness value from
(6). fmin ðeÞ and fmax ðeÞ are the minimum and maximum
fitness value among all real feature edges, respectively. is
a parameter to scale the boosting effect exponentially.
We temporally interpolate two frames using both the
forward temporal flow from Ii ðtÞ to Ii ðt þ tÞ and backward temporal flow from Ii ðt þ tÞ to Ii ðtÞ. The final pixel
intensity is calculated by cross dissolving:
Pt0 ¼ Pforward t þ t t0
t0 t
;
þ Pbackward t
t
ð9Þ
where Pforward is the pixel color calculated only from
frame Ii ðtÞ, and Pbackward is the pixel color calculated only
from frame Ii ðt þ tÞ. This is because we have more
confidence with Ii ðtÞ’s features in the forward flow and
Ii ðt þ tÞ’s features in the backward flow. These features
are selected directly on images.
Fig. 8. Three feature points p0 , p1 , and p2 are selected on the same dark
boundary, forming three edges e0 , e1 , and e2 . We only keep the longest
one e2 , which represents the entire edge in the real world.
Fig. 10. The interpolated result using different schemes. (a) Image
morphing without epipolar test. (b) Direct image triangulation with
epipolar test. (c) Image morphing with epipolar test, but without extra
edge weights. (d) Image morphing with both epipolar test and extra edge
weights.
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
704
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,
VOL. 13,
NO. 4, JULY/AUGUST 2007
Fig. 12. An illustration of three possible cases for the temporal offset ðt~Þ
between two frames. (a) shows a feature point in camera Cj at time t0 ; its
epipolar line in Ci can intersect the feature trajectory (from t to t þ t) in
Ci in three ways, as shown in (b).
Fig. 11. A reconstruct 3D mesh proxy. (a) Oblique view. (b) Front view.
which can utilize graphics texture hardware to blend
appropriate pixels together from the nearest cameras in
order to compose the desired image.
ULR requires an approximation of the scene (a geometric
proxy) as an input. We implemented two options. The first is
a 3D planar proxy in which the plane’s depth can be
interactively controlled by the user. Its placement determines
which region of the scene is in focus. The second option is a
reconstructed proxy using the spatial correspondences. For
unsynchronized input, the correspondences are interpolated
to the same instant. The 3D positions can then be triangulated. In order to reconstruct a proxy that covers the entire
image, we also estimate a backplane based on the reconstructed 3D vertices. The plane is positioned and oriented so
that all the vertices are in the front. Finally, all the vertices
(including the four corners of the plane) are triangulated to
create a 3D mesh. Fig. 11 shows an example mesh.
Whether to use a plane or a mesh proxy in rendering
depends mainly on the camera configuration and scene
depth variation. As we will show later, in Section 8, the
planar proxy works well for scenes with small depth
variations. On the other hand, using a reconstructed mesh
proxy improves both the depth of field and the 3D effect for
oblique-angle viewing.
7
ESTIMATION
OF
TEMPORAL OFFSET
With unsynchronized input, the temporal offset between
any two sequences is required in many aspects of our
system. This offset can be obtained in several ways. Some
types of cameras such as the ones from Point Grey Research
[38] can provide a global time stamp for every frame. One
can also insert a high-resolution clock into the scene as a
synchronization object. If neither approach is applicable, we
present here a robust technique to estimate the temporal
offset in software. Again, our technique is based on the
epipolar constraint.
We estimate the offset pairwise by the temporal correspondence ðpi ðtÞ; pi ðt þ tÞÞ and spatial correspondence
ðpi ðtÞ; pj ðt0 ÞÞ from three frames. We know that the temporal
offset t~ ¼ t0 t satisfies
pi ðt0 Þ Fij pj ðt0 Þ ¼ 0;
t t~
t~
pi ðtÞ þ
pi ðt0 Þ ¼
pi ðt þ tÞ;
t
t
ð10Þ
where Fij is the fundamental matrix between Ci and Cj ,
and pi ðt0 Þ is the estimated location at time t0 , assuming
temporary linear motion. If all feature points are known,
the equation system above can be organized as a single
linear equation of one unknown t~. Geometrically, this
equation finds the intersection between pj ðt0 Þ’s epipolar
line and the straight line defined by pi ðtÞ and pi ðt þ tÞ.
As shown in Fig. 12, if the intersection happens between
pi ðtÞ and pi ðt þ tÞ, that means 0 < t~ < t; if the intersection happens before pi ðtÞ, that means t~ < 0; and if the
intersection happens beyond pi ðt þ tÞ, then t~ > t.
Ideally, given frames Ii ðtÞ and Ij ðt0 Þ, we can calculate the
offset t~ using just a single feature in (10). Unfortunately, the
flow computation is not always correct in practice. To
provide a more robust and accurate estimation of the
temporal offset, we design our method as four steps:
select salient features on image Ii ðtÞ, calculate the
temporal flow from Ii ðtÞ to Ii ðt þ tÞ and the spatial
flow from Ii ðtÞ to Ij ðt0 Þ using the technique described
in Section 4.1,
2. calculate the time offset t~½1 ; t~½2 ; . . . t~½N for each
feature P½1 ; P½2 ; . . . P½N using (10),
3. detect and remove time offset outliers using the
Random Sample Consensus (RANSAC) algorithm
[39], assuming that outliers are primarily caused by
random optical flow errors, and
4. calculate the final time offset using a Weighted Least
Squares (WLS) method, given the remaining inliers
from RANSAC.
We enumerate all possible offsets (each feature correspondence gives one offset estimation) as offset candidates in
RANSAC. For each candidate, we determine inliers as the
consensus subset of estimated offsets that are within a
tolerance threshold d from the candidate offset. d is
typically 0:05t in our experiments, where t is usually
one-frame time, that is, 1/30 sec. We then treat the largest
consensus subset as inliers among initial offsets.
Even after we remove outliers using RANSAC, offsets
computed from inliers can still vary because of various
errors. Therefore, we further refine the temporal offset
estimation from the remaining inliers using a WLS
approach. The cost function is defined as
1.
C¼
M
X
wk ðt~ t~k Þ2 ;
ð11Þ
k¼1
where M is the total number of remaining inliers, and the
weight factor wk is
~ ~
wk ¼ ejttk j ;
0:
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
ð12Þ
WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING
705
Fig. 13. The space-time light field rendering pipeline.
The weight factor is sensitive when is large; otherwise, the
formulation is close to a standard least squares fitting as goes to zero. We can now reformulate (11) as a weighted
average:
M
X
dfðt~Þ
¼2
wk ðt~ t~k Þ ¼ 0;
dt~
k¼1
t~ ¼
M
X
k¼1
wk t~k =
M
X
ð13Þ
wk :
k¼1
We first calculate the mean value of all the inlier offsets as
the initial offset. Then, during each regression step, we
recalculate the weight factor for each inlier and then the
weighted average offset t~. Since RANSAC has already
removed most outliers, the WLS fitting can converge fast.
We have empirically found out several criteria to
determine how good it is to choose a certain frame Ii ðtÞ
for the offset estimation. First, feature motions should be
sufficiently large. The estimation would be more difficult
and inaccurate if the scene is almost static. Second, we
prefer feature motions to be perpendicular to the epipolar
line, since the offset t~ is found by solving the intersection of
the feature trajectory and the epipolar line. Third, we may
choose frames with more salient features and less occlusion
to avoid correspondence errors. In general, we typically
choose good short sequences, determine the rough alignment, and then calculate the estimated t~ for each frame Ii ðtÞ.
If we can further assume that a sequence is captured at a
regular interval without dropped frames, we can take the
average of all estimations to get the final temporal offset for
the entire sequence. The estimated temporal offset can
further be used to remove erroneous correspondences as
outlined in Section 4.2.
8
RESULTS
More results can be found in the accompanying video or
online at http://vis.uky.edu/~ryang/media/ST-LFR/.
We have implemented our ST-LFR framework and tested
it with real data. In Fig. 13, we show a graphical
representation of our entire processing pipeline. To facilitate data capturing, we built two multicamera systems. The
first system uses eight Point Grey Dragonfly cameras [38]:
three color ones and five gray-scale ones (see Fig. 1a). The
mixed use of different cameras is due to our resource limit.
This system can provide a global time stamp for each frame
in hardware. The second system includes eight Sony color
digital FireWire cameras, but the global time stamp is not
available. In both systems, the cameras are approximately
60 mm apart, limited by their form factor. Based on the
analysis in [40], the effective depth of the field is about
400 mm. The cameras are calibrated, and all images are
rectified to remove lens distortions.
Since our cameras are arranged in a horizontal linear
array, we only select one camera for the epipolar constraint
according to the discussion before, in Section 4.3. The
closeness is just the camera position distance.
We first demonstrate our results from the data tagged
with a global time stamp, that is, the temporal offsets are
known. The final images are synthesized using either a
planar proxy or a reconstructed proxy from spatial
correspondences. Then, we quantitatively study the accuracy of our temporal offset estimation scheme. Finally, we
put everything together by showing synthesis results from a
data set without a global time stamp, followed by a
discussion of our approach.
8.1 View Synthesized with a Planar Proxy
To better demonstrate the advantage of ST-LFR over the
traditional LFR, we first show some results using a planar
proxy for spatial interpolation. The data sets used in this
section were captured by the Point Grey camera system
with known global time stamps. Our first data set includes
a person waving his hand, as Fig. 1 shows. Since the depth
variation from the hands to the head is slightly beyond the
focus range (400 mm), we can see some tiny horizontal
ghosting effects in the synthesized image, which are entirely
different from vertical mismatches caused by the hand’s
vertical motion. We will show later that these artifacts can
be reduced by using a reconstructed proxy.
We emphasize the importance of virtual synchronization
in Fig. 14, in which the image is synthesized with a constant
blending weight (1/8) for all the eight cameras. The scene
contains a moving magazine. Without virtual synchronization (Fig. 14a), the text on the magazine cover is illegible.
This problem is rectified after we registered feature points
and generated virtually synchronized images.
Our next data set is a moving open book. In Fig. 15, we
synthesized several views using traditional LFR and
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
706
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,
VOL. 13,
NO. 4, JULY/AUGUST 2007
Fig. 14. Synthesized results using a uniform blending weight (that is,
pixels from all frames are averaged together). (a) Traditional LFR with
unsynchronized frames. (b) ST-LFR.
ST-LFR. Results from ST-LFR remain sharp from different
viewpoints. The noticeable change of intensity is due to the
mixed use of color and gray-scale cameras.
8.2 View Synthesized with a Reconstructed Proxy
As mentioned in Section 6, building a 3D mesh geometry
proxy improves the synthesis results for scenes with large
depth range. This method is adopted in the following two
data sets, and the results are compared with plane proxy.
The first data set is the open book, as shown in Fig. 16. It is
obvious that the back cover of the book in this example is
better synthesized with the 3D mesh proxy for its improved
depth information. The second example shown in Fig. 17 is
a human face viewed from two different angles. The face
(a) rendered with the planar proxy shows some noticeable
distortions in side views. The comparative results are better
illustrated in the accompanying video.
8.3 Experiments on Temporal-Offset Estimation
All the above results are obtained from data sets in which
the global time stamp for each frame is known. However,
since most commodity cameras do not provide such
Fig. 15. Synthesized results of the book example. (a) Traditional LFR
with unsynchronized frames. (b) ST-LFR.
Fig. 16. Top row: synthesized results of the book sequence. Bottom row:
the reconstructed proxy. (a) Plane-proxy adopted. (b) Three-dimensional reconstructed mesh proxy adopted.
capability, the temporal offsets of sequences captured by
those cameras need to be estimated first. We now evaluate
the accuracy of our estimation technique discussed in
Section 7.
We first tested with a simulated data set in which feature
points are moving randomly. We add random correspondences (outliers) and perturb the ground-truth feature
correspondences (inliers) by adding a small amount of
Gaussian noise. Fig. 18 plots the accuracy of our approach
with various outlier ratios. It shows that our technique is
Fig. 17. Synthesized results of a human face from two different angles.
(a) Plane proxy adopted. (b) 3D mesh proxy adopted.
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING
Fig. 18. Temporal offset estimation error using synthetic tracked
features. The ratio of outliers varies from 10 percent to 90 percent.
Using three variations of our techniques, the relative errors to the ground
truth are plotted. The error is no greater than 15 percent in any test
case.
extremely robust even when 90 percent of the offsets are
outliers.
Our second data set is a pair of video sequences with the
ground-truth offset. This is obtained from our Point Grey
camera system. Each sequence, captured at 30 fps, contains
11 frames. The temporal offset between the two is 10 ms,
that is, 0.3 frame time. Fig. 19 shows the estimated temporal
offset; the error is typically within 0.05 frame time. The
average epipolar distance for inliers is less than 1 pixel.
From the above experiments with known ground truth,
we can see that our approach can produce very accurate
and robust estimation of the temporal offset. In addition,
combining RANSAC and WLS fitting yields better results
than either of these techniques alone.
The second formula in (10) is a linear approximation of
pi ðt0 Þ using the feature correspondence pi ðtÞ and pi ðt þ tÞ.
Although object motions in most scenes are nonlinear, our
experimental result indicates that the linear approximation
is valid for sequences captured at video rate, that is, greater
than 30 fps. We did, however, observe that the best
estimation was obtained when t~ satisfied 0 t~ t. Therefore, given frame Ii ðtÞ, we need to choose frame Ij ðt0 Þ that is
captured between Ii ðtÞ and Ii ðt þ tÞ. Given a pair of
roughly aligned sequences, we search several frames ahead
and, afterward, find the appropriate pair frames.
We also compare our method with the approach by Zhou
and Tao [23]. Their method, which can also achieve
subframe accuracy, uses four corresponding points (pi ðtÞ,
pi ðt þ tÞ, pj ðt0 Þ, and pj ðt0 þ tÞ) to compute the cross ratio
to obtain offset t~. Instead of using all four correspondence
points, we simply use three, excluding the point pj ðt0 þ tÞ
in our method. Experiments show that our method usually
performs slightly better. The advantage of using only three
frames is obvious when there are dropped frames.
Fig. 19. Estimated time offset using real-image sequences. The
magenta line shows the exact time offset (0.3 frame time or 10 ms).
Each of the test sequences is 11 frames long.
707
Fig. 20. Estimated time offset using synthetic image sequences. The
blue line shows the exact time offset (0.5 frame time). Each of the test
sequences is 31 frames long.
Our first data set is synthetic sequences with a dynamic
scene generated by OpenGL. The dynamic scene consists of
a static background and a moving rectangle. Both surfaces
are well textured. Images were synthesized at two viewpoints mimicking a standard stereo setup. We render the
scene from two viewpoints with a temporal offset of half a
frame, that is, t~ ¼ 0:5. The estimation is shown in Fig. 20.
Our second data set is the same as the first one except
that it contains lost frames. As shown in Fig. 21, the
sequence from one of the viewpoints lost the sixth frame.
Thus, the ground-truth time offset jumps from half a frame
to one and a half frame. Our method gains an advantage
over that in [23] in the fifth and seventh frames.
Our last data set is a real pair of rotating magazine
sequences with a known offset (shown in Fig. 6). Results are
shown in Fig. 22. In this sequence, our estimates are
generally better. This example also shows an interesting
setup that both methods failed. In that frame (22), the
movements of the magazine are almost parallel to the
epipolar line—a degenerated case for both methods.
8.4 Data with Unknown Temporal Offset
Our last data set is captured with the SONY color camera
system without global time stamp. It contains a swinging toy
ball. We started all cameras simultaneously to capture a set
of roughly aligned sequences. Then, we use the technique
introduced in Section 7 to estimate a temporal offset for
each sequence pair, using the sequence from the middle
camera as the reference. Since we do not know the ground
truth, we can only verify the offset accuracy by looking at
the epipolar distances for temporally interpolated feature
Fig. 21. Estimated time offset using synthetic image sequences with the
sixth and seventh frames lost. The blue line shows the exact time offset
(0.5 frame time or 10 ms). Each of the test sequences is nine frames
long.
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
708
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,
VOL. 13,
NO. 4, JULY/AUGUST 2007
Fig. 22. Estimated time offset using real-image sequences. The blue line
shows the exact time offset (0.3 frame time or 10 ms). Each of the test
sequences is 31 frames long.
points. Fig. 23a shows the temporal optical flow seen on a
frame Ii ðtÞ, and Fig. 23b shows the pi ðt0 Þ’s epipolar lines
created using our estimated time offset on the frame Ij ðt0 Þ.
Note that the spatial correspondences pj ðt0 Þ are right on the
epipolar lines, which verifies the correctness of our
estimated temporal offset. Using a planar proxy, the
synthesized results are shown in Fig. 24. A visualization
of the motion trajectory of the ball can be found in the last
row of Fig. 1.
8.5 Discussion
In comparison with the traditional LFR, our formulation
requires correspondence information. Finding correspondence is an open problem in computer vision, and it can be
fragile in practice. Although we have designed several
robust techniques to handle almost inevitable mismatches
and experimental results have been quite encouraging, we
have not studied our techniques’ stability over long
sequences yet. On the other hand, the biggest problem in
matching—textureless regions [41]—is circumvented in our
formulation since we only compute a sparse flow field that
is sufficient to synthesize novel views.
Table 1 lists some parameters discussed earlier and their
empirical value ranges used in our experiment. Since the
epipolar constraint is independent of the data set, the
epipolar constraint threshold always falls into the threepixel to five-pixel range. However, the reference camera
selection parameter is sensitive to both the camera array
settings and captured scenes. The effects and choices of a, b,
and in the morphing (7) are well explained in [10].
Parameters , , and , though having different units, are all
used to boost some exponential weights. We typically
Fig. 24. Synthesized results of the toy ball sequence. Note that the input
data set does not contain global time stamps. (a) Traditional LFR with
unsynchronized frames. (b) ST-LFR, using estimated temporal offsets
among input sequences.
adjust those weighting parameters so that the largest weight
is 1,000 times greater than the smallest weight value. This
works well for most data sets we have tried so far.
Regarding speed, our current implementation is not real
time yet. The average time for synthesizing one single frame
takes a few minutes, depending on the number of detected
features. The main bottleneck is the temporal interpolation
step. To form the matching edge set, it also requires a quite
exhaustive search, even after some acceleration structures
are adopted.
9
CONCLUSION
AND
FUTURE WORK
In this paper, we extend the traditional LFR to the temporal
domain to accommodate dynamic scenes. Instead of capturing the dynamic scene in strict synchronization and treating
each image set as an independent static light field, our notion
of a space-time light field simply assumes a collection of video
sequences. These sequences may or may not be synchronized,
and they can have different capture rates.
In order to be able to synthesize novel views from any
viewpoint at any time instant, we developed a two-stage
rendering algorithm: 1) temporal interpolation that registers
successive frames with spatial-temporal optical flow and
generates synchronized frames using an edge-guided
image-morphing algorithm that preserves important edge
TABLE 1
Parameters and Their Typical Values
Fig. 23. Verifying estimated time offset using epipolar lines. (a) Temporal
optical flow on camera Ci . (b) The epipolar line of pi ðt0 Þ (which is linearly
interpolated based on the temporal offset) and pj ðt0 Þ on camera Cj .
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING
features and 2) spatial interpolation that uses ULR to create
the final view from a given viewpoint. Experimental results
have shown that our approach is robust and capable of
maintaining photorealistic results comparable to traditional
static LFR even when the dynamic input is not synchronized. Furthermore, by allowing blending over the entire
space-time light field space, we demonstrated a novel visual
effect such as the visualization of motion trajectory in 3D,
which cannot be produced with traditional LFR.
We also present a technique to find the temporal
alignment between sequences, in cases when the global
time stamp is not available. Our approach is very accurate
and extremely robust. It can estimate a temporal offset with
subframe accuracy even when a very large portion (up to
90 percent) of feature matching is incorrect. Our technique
can be applied in many other areas, such as stereo vision
and multiview tracking, to enable the use of unsynchronized video input.
Looking into the future, we plan to investigate a more
efficient interpolation method so that we can render spacetime light field in real time and online, without significantly
losing the image qualities. This real-time capability, together
with the use of inexpensive Web cameras, can contribute to a
wider diffusion of light field techniques into many interesting
applications, such as 3D video teleconferencing, remote
surveillance, and telemedicine.
ACKNOWLEDGMENTS
The authors would like to thank Greg Turk for suggestions
and Sifang Li for participation in the experiments. This
work was done while the authors were at the University of
Kentucky and is supported in part by the University of
Kentucky Research Foundation, the US Department of
Homeland Security, and US National Science Foundation
Grant IIS-0448185.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
M. Levoy and P. Hanrahan, “Light Field Rendering,” Proc. Int’l
Conf. Computer Graphics and Interactive Techniques (SIGGRAPH ’96),
pp. 31-42, Aug. 1996.
S.J. Gortler, R. Grzeszczuk, R. Szeliski, and M.F. Cohen, “The
Lumigraph,” Proc. Int’l Conf. Computer Graphics and Interactive
Techniques (SIGGRAPH ’96), pp. 43-54, Aug. 1996.
T. Naemura, J. Tago, and H. Harashima, “Realtime Video-Based
Modeling and Rendering of 3D Scenes,” IEEE Computer Graphics
and Applications, vol. 22, no. 2, pp. 66-73, 2002.
J.C. Yang, M. Everett, C. Buehler, and L. McMillan, “A Real-Time
Distributed Light Field Camera,” Proc. 13th Eurographics Workshop
Rendering, pp. 77-86, 2002.
B. Wilburn, M. Smulski, H. Lee, and M. Horowitz, “The Light
Field Video Camera,” Media Processors 2002, Proc. SPIE, vol. 4674,
pp. 29-36, 2002.
W. Matusik and H. Pfister, “3D TV: A Scalable System for RealTime Acquisition, Transmission, and Autostereoscopic Display of
Dynamic Scenes,” Proc. Int’l Conf. Computer Graphics and Interactive
Techniques (SIGGRAPH ’04), pp. 814-824, 2004.
B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, A. Barth,
A. Adams, M. Horowitz, and M. Levoy, “High Performance
Imaging Using Large Camera Arrays,” ACM Trans. Graphics,
vol. 24, no. 3, pp. 765-776, 2005.
E. Shechtman, Y. Caspi, and M. Irani, “Increasing Space-Time
Resolution in Video,” Proc. Seventh European Conf. Computer Vision
(ECCV ’02), pp. 753-768, 2002.
S. Vedula, S. Baker, and T. Kanade, “Image-Based SpatioTemporal Modeling and View Interpolation of Dynamic Events,”
ACM Trans. Graphics, vol. 24, no. 2, pp. 240-261, 2005.
709
[10] T. Beier and S. Neely, “Feature Based Image Metamorphosis,”
Proc. Int’l Conf. Computer Graphics and Interactive Techniques
(SIGGRAPH ’92), pp. 35-42, 1992.
[11] E.H. Adelson and J. Bergen, “The Plenoptic Function and the
Elements of Early Vision,” Computational Models of Visual Processing, pp. 3-20, Aug. 1991.
[12] H.Y. Shum and L.W. He, “Rendering with Concentric Mosaics,”
Proc. Int’l Conf. Computer Graphics and Interactive Techniques
(SIGGRAPH ’97), pp. 299-306, 1997.
[13] I. Ihm, S. Park, and R.K. Lee, “Rendering of Spherical Light
Fields,” Pacific Graphics, 1998.
[14] C. Zhang and T. Chen, “A Self-Reconfigurable Camera Array,”
Proc. Eurographics Symp. Rendering, pp. 243-254, 2004.
[15] T. Kanade, P.W. Rander, and P.J. Narayanan, “Virtualized Reality:
Constructing Virtual Worlds from Real Scenes,” IEEE MultiMedia
Magazine, vol. 1, no. 1, pp. 34-47, 1997.
[16] W. Matusik, C. Buehler, R. Raskar, S. Gortler, and L. McMillan,
“Image-Based Visual Hulls,” Proc. Int’l Conf. Computer Graphics
and Interactive Techniques (SIGGRAPH ’00), pp. 369-374, Aug. 2000.
[17] R. Yang, G. Welch, and G. Bisop, “Real-Time Consensus-Based
Scene Reconstruction Using Commodity Graphics Hardware,”
Proc. Pacific Graphics Conf., pp. 225-234, Oct. 2002.
[18] C. Zitnick, S.B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski,
“High-Quality Video View Interpolation Using a Layered Representation,” ACM Trans. Graphics, vol. 23, no. 3, pp. 600-608, 2004.
[19] Y. Caspi and M. Irani, “A Step Towards Sequence to Sequence
Alignment,” Proc. Conf. Computer Vision and Pattern Recognition
(CVPR ’00), pp. 682-689, 2000.
[20] Y. Caspi and M. Irani, “Alignment of Non-Overlapping Sequences,” Proc. Int’l Conf. Computer Vision (ICCV ’01), pp. 76-83,
2001.
[21] C. Rao, A. Gritai, M. Shah, and T. Syeda-Mahmood, “ViewInvariant Alignment and Matching of Video Sequences,” Proc.
Int’l Conf. Computer Vision (ICCV ’03), pp. 939-945, 2003.
[22] P. Sand and S. Teller, “Video Matching,” ACM Trans. Graphics,
vol. 23, no. 3, pp. 592-599, 2004.
[23] C. Zhou and H. Tao, “Depth Computation of Dynamic Scenes
Using Unsynchronized Video Streams,” Proc. Conf. Computer
Vision and Pattern Recognition (CVPR ’03), pp. 351-358, 2003.
[24] S. Borman and R. Stevenson, “Spatial Resolution Enhancement of
Low-Resolution Image Sequences—A Comprehensive Review
with Directions for Future Research,” technical report, Laboratory
for Image and Signal Analysis (LISA), Univ. of Notre Dame, 1998.
[25] J.R. Bergen, P. Anandan, K.J. Hanna, and R. Hingorani,
“Hierarchical Model-Based Motion Estimation,” Proc. Second
European Conf. Computer Vision (ECCV ’92), pp. 237-252, 1992.
[26] J. Barron, D.J. Fleet, and S.S. Beauchemin, “Performance of Optical
Flow Techniques,” Int’l J. Computer Vision, vol. 12, no. 1, pp. 43-77,
1994.
[27] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade,
“Three-Dimensional Scene Flow,” Proc. Int’l Conf. Computer
Vision (ICCV ’99), vol. 2, pp. 722-729, 1999.
[28] K. Kutulakos and S.M. Seitz, “A Theory of Shape by Space
Carving,” Int’l J. Computer Vision (IJCV), vol. 38, no. 3, pp. 199-218,
2000.
[29] H. Wang and R. Yang, “Towards Space-Time Light Field
Rendering,” Proc. 2005 Symp. Interactive 3D Graphics and Games
(I3D ’05), pp. 125-132, 2005.
[30] C.J. Harris and M. Stephens, “A Combined Corner and Edge
Detector,” Proc. Fourth Alvey Vision Conf., pp. 147-151, 1988.
[31] J.-Y. Bouguet, “Pyramidal Implementation of the Lucas Kanade
Feature Tracker Description of the Algorithm,” technical report,
1999.
[32] B.D. Lucas and T. Kanade, “An Iterative Image Registration
Technique with an Application to Stereo Vision,” Proc. Int’l Joint
Conf. Artificial Intelligence, pp. 674-679, 1981.
[33] C. Tomasi and T. Kanade, “Detection and Tracking of Point
Features,” Technical Report CMU-CS-91-132, Carnegie Mellon
Univ., 1991.
[34] L. Zhang, B. Curless, and S. Seitz, “Spacetime Stereo: Shape
Recovery for Dynamic Scenes,” Proc. Conf. Computer Vision and
Pattern Recognition (CVPR ’03), pp. 367-374, 2003.
[35] J. Canny, “A Computational Approach to Edge Detection,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 8, no. 6,
pp. 679-698, 1986.
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
710
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,
[36] J.R. Shewchuk, “Triangle: Engineering a 2D Quality Mesh
Generator and Delaunay Triangulato,” Applied Computational
Geometry: Towards Geometric Eng., LNCS 1148, pp. 203-222, 1996.
[37] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen,
“Unstructured Lumigraph Rendering,” Proc. Int’l Conf. Computer
Graphics and Interactive Techniques (SIGGRAPH ’01), pp. 405-432,
Aug. 2001.
[38] Point Grey Research, http://www.ptgrey.com, 2007.
[39] M.A. Fischler and R.C. Bolles, “Random Sample Consensus: A
Paradigm for Model Fitting with Applications to Image Analysis
and Automated Cartography,” Comm. ACM, vol. 24, pp. 381-395,
1981.
[40] J.-X. Chai, X. Tong, S.-C. Chan, and H.-Y. Shum, “Plenoptic
Sampling,” Proc. Int’l Conf. Computer Graphics and Interactive
Techniques (SIGGRAPH ’00), pp. 307-318, 2000.
[41] S. Baker, T. Sim, and T. Kanade, “When Is the Shape of a Scene
Unique Given Its Light-Field: A Fundamental Theorem of 3D
Vision,” IEEE Trans. Pattern Analysis and Machine Intelligence
(PAMI), vol. 25, no. 1, pp. 100-109, 2003.
Huamin Wang received the BS degree from
Zhejiang University, China, in 2002 and the MS
degree in computer science from Stanford
University in 2004. He is now a PhD candidate
in computer science at the Georgia Institute of
Technology, working under the direction of
Professor Greg Turk. His current work is focused
on simulating small-scale fluid effects, texture
synthesis, and light field rendering. He is a
recipient of an NVIDIA fellowship in 2006–2007
and a student member of the IEEE Computer Society and the ACM.
VOL. 13,
NO. 4, JULY/AUGUST 2007
Mingxuan Sun received the BS degree from
Zhejiang University, China, in 2004 and the MS
degree in computer science from the University
of Kentucky in 2006. She is now a PhD student
in computer science at the Georgia Institute of
Technology. Her current work is focused on
shading correction and light field rendering. Prior
to Georgia Tech, she has also worked in the
Center for Visualization and Virtual Environments at the University of Kentucky. She is a
student member of the IEEE.
Ruigang Yang received the MS degree in
computer science from Columbia University in
1998 and the PhD degree in computer science
from the University of North Carolina at Chapel
Hill in 2003. He is an assistant professor in the
Computer Science Department at the University
of Kentucky. His research interests include
computer graphics, computer vision, and multimedia. He is a recipient of the US NSF
CAREER award in 2004 and a member of the
IEEE and the ACM.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.
Download