IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 4, JULY/AUGUST 2007 697 Space-Time Light Field Rendering Huamin Wang, Student Member, IEEE Computer Society, Mingxuan Sun, Student Member, IEEE, and Ruigang Yang, Member, IEEE Abstract—In this paper, we propose a novel framework called space-time light field rendering, which allows continuous exploration of a dynamic scene in both space and time. Compared to existing light field capture/rendering systems, it offers the capability of using unsynchronized video inputs and the added freedom of controlling the visualization in the temporal domain, such as smooth slow motion and temporal integration. In order to synthesize novel views from any viewpoint at any time instant, we develop a two-stage rendering algorithm. We first interpolate in the temporal domain to generate globally synchronized images using a robust spatialtemporal image registration algorithm followed by edge-preserving image morphing. We then interpolate these software-synchronized images in the spatial domain to synthesize the final view. In addition, we introduce a very accurate and robust algorithm to estimate subframe temporal offsets among input video sequences. Experimental results from unsynchronized videos with or without time stamps show that our approach is capable of maintaining photorealistic quality from a variety of real scenes. Index Terms—Image-based rendering, space-time light field, epipolar constraint, image morphing. Ç 1 INTRODUCTION D the last few years, image-based modeling and rendering (IBMR) has become a popular alternative to synthesizing photorealistic images of real scenes, whose complexity and subtleties are difficult to capture with traditional geometry-based techniques. Among various IBMR approaches, light field rendering (LFR) [1], [2] is a representative one that uses many images captured from different viewpoints of the scene of interests, that is, a light field. Given a light field, the rendering process becomes a simple operation of rearranging recorded pixel intensities. The process can, in theory, guarantee rendering quality for any type of scenes without resorting to difficult and often fragile depth reconstruction approaches. With technological advancement, it is now possible to build camera arrays to capture a dynamic light field for moving scenes [3], [4], [5], [6]. However, most existing camera systems rely on synchronized input: Each set of images taken at the same instant is treated as a separate light field. Although a viewer can continuously change the viewpoint in the spatial domain within each static light field, exploring in the temporal domain is still discrete. The only exception we have found so far is the high-performance light field array from Stanford [7] in which customized cameras are triggered in a staggered pattern to increase the temporal sampling rate. The triggering pattern is carefully designed to accommodate the expected object motion. URING . H. Wang and M. Sun are with the Georgia Institute of Technology, Tech Square Research Building, 85 5th Street NW, Atlanta, GA 30332-0760. E-mail: {whmin, cynthia}@cc.gatech.edu. . R. Yang is with the Department of Computer Science, University of Kentucky, One Quality Street, Suite 800, Lexington, KY 40507. E-mail: ryang@cs.uky.edu. Manuscript received 7 June 2006; accepted 25 Oct. 2006; published online 17 Jan. 2007. For information on obtaining reprints of this article, please send e-mail to: tvcg@computer.org, and reference IEEECS Log Number TVCG-0084-0606. Digital Object Identifier no. 10.1109/TVCG.2007.1019. 1077-2626/07/$25.00 ß 2007 IEEE In this paper, we present the notion of space-time light field rendering (ST-LFR) that allows continuous exploration of a dynamic light field in both the spatial and temporal domain. A space-time light field is defined as a collection of video sequences captured by a set of tightly packed cameras that may or may not be synchronized. The basic goal of STLFR is to synthesize novel images from an arbitrary viewpoint at an arbitrary time instant t. Traditional LFR is therefore a special case of ST-LFR at some fixed time instant. One may think that, given a sufficiently high capture rate (that is, full National TV Standards Committee (NTSC) rate of 30 fps), applying traditional LFR to the set of images captured closest to t could generate reasonable results. Unfortunately, a back-of-the-envelope calculation shows that this is usually not true. Typical body motions (approximately 0.6 m/s) at sitting distance can cause an offset of more than 10 pixels per frame in a 30 fps VGA sequence. As shown in Fig. 1b, the misalignment caused by the subject’s casual (hand) motion is quite evident in the synthesized image even when the temporal offset is less than a half frame time, that is, 1/60 sec. Although the idea of spatial-temporal view synthesis has been explored in the computer vision literature with sparse camera arrays (for example, [8] and [9]), we believe that it is the first time such a goal has been articulated in the context of LFR. Using many images from a densely packed camera array ameliorates the difficulty in 3D reconstruction. Nevertheless, we still have to deal with the fundamental correspondence problem. Toward this end, we adopt the following guidelines: 1) We only need to find matches for salient features that are both critical for view synthesis and easy to detect and 2) we cannot guarantee that automatically found correspondences are correct; therefore, we seek ways to “hide” these mismatches during the final view synthesis. Based on the above two guidelines, we are able to synthesize photorealistic results without completely solving the 3D reconstruction problem. Published by the IEEE Computer Society Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. 698 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 4, JULY/AUGUST 2007 A novel procedure to estimate temporal offsets among input video sequences. It is very robust and accurate. Our test using video sequences with ground-truth time stamps shows the error in temporal offset is usually less than 2 ms or 1/16 of a frame time (captured at 30 fps). Fig. 1 compares results from our ST-LFR approach with traditional LFR. Note that ST-LFR not only can synthesize realistic images for any given time instant, but also can visualize smooth motion trajectory by integrating images in the temporal domain. The improvements are much more obvious in the video at http://vis.uky.edu/~gravity/Research/st-lfr/ st-lfr.htm. . 2 Fig. 1. (a) A camera array we used to capture video sequences. The cameras are running at 30 fps without intercamera synchronization (that is, genlock). (b) Pictures synthesized using traditional LFR. (c) Pictures synthesized using ST-LFR. The bottom pictures show the visualization of a motion trajectory. We overlay images for each sequence, which is similar to temporal integration. These integrated images are used as input for LFR. The trajectories (for example, from the highlights) in (b) are much smoother since in-between frames are accurately synthesized and used for temporally integrated images. Our approach to ST-LFR can be decomposed into two stages. First, we interpolate original video sequences in the temporal domain to synthesize images for a given time instant. With these virtually synchronized images, we use traditional LFR techniques to render novel views in the spatial domain. The major technical difficulty of ST-LFR is in the first step: synthesizing “in-between” frames within each sequence, toward which we made the following contributions: . . A robust spatial-temporal image registration algorithm using optical flow and the epipolar constraint. The number of mismatches can be greatly reduced after enforcing both intersequence and intrasequence matching consistency. Furthermore, the spatial correspondences can be used to construct a 3D geometric proxy that improves the depth of field in LFR. A robust temporal image interpolation algorithm that provides superior results in the presence of inevitable mismatches. Our algorithm, based on image morphing [10], incorporates a varying weight term using image gradient to preserve edges. RELATED WORK 2.1 Light-Fielded-Based Rendering (LFR) The class of light-field-based rendering techniques is formulated around the plenoptic function, which describes all of the radiant energy that can be perceived by an observer at any point in space and time [11]. Levoy and Hanrahan [1] pointed out that the plenoptic function can be simplified in the regions of free space (that is, space free of occluders), since radiant energy does not change over free space. The reduced function (the so-called light field) can be interpreted as a function on the space of oriented light rays. The light field can be directly recorded using multiple cameras. Once acquired, LFR becomes a simple task of table lookups to retrieve the recorded radiant intensities. A major disadvantage of light-field-based techniques is the huge amount of data required; therefore, their application was traditionally limited to static scenes [1], [2], [12], [13]. The acquisition of dense dynamic light fields becomes feasible only recently with technological advancement. Most systems use a dense array of synchronized cameras to acquire dynamic scenes at discreet temporal intervals [3], [4], [5], [6]. Images at each time instant are collectively treated as an independent light field. As a result, viewers can explore continuously only in the spatial domain but not in the temporal domain. To create the impression of dynamic events, the capture frame rate has to be sufficiently high, further increasing the demand for data storage and processing. In addition, hardware camera synchronization is a feature available only on high-end cameras, which significantly increases the system cost. Without synchronization, misalignment problems will occur in fast-motion scenarios, as reported in one of the few light field systems built with low-cost commodity Web cameras [14]. Just recently, Wilburn et al. [7] presented a light field camera array using custom-made hardware. Many interesting effects are demonstrated, such as seeing through occluders and superresolution imagery. Particularly relevant to our work is to use staggered triggering to increase the temporal resolution. All cameras are synchronized under a global clock, but each is set to a potentially different shutter delay. By combining images taken at slightly different times, the synthesized images can produce smooth temporal motion as if taken by a high-speed camera. In order to reduce ghosting effect, images are warped based on optical flow, which we will discuss more Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING 699 Fig. 2. The structure of our ST-LFR algorithm. later. From a system standpoint, our ST-LFR framework is more flexible since it does not require a known temporal offset. 2.2 View Synthesis in Computer Vision The task of view synthesis, that is, generating images from a novel viewpoint given a sparse set of camera images, is an active research topic in computer vision. Many of these techniques focus on the dense geometric reconstruction of the scene (for example, [15], [16], [17], and [18]). Although some very impressive results have been obtained, the problem of depth reconstruction remains open. In addition, synchronized camera input is typically required. With unsynchronized input, there are techniques to find the temporal offset [19], [20], [21], [22], [23], but most of these techniques do not provide subframe accuracy, that is, the alignment resolution is one frame at the best, which, as we have presented a simple calculation in Section 1, is not adequate for view synthesis. Though subframe accuracy was achieved, the proposed technique in [23] requires regular capturing rate and a minimum of four frames as input. Our method only requires three frames at any rate and is extremely robust in the presence of feature matching outliers. Furthermore, most of these alignment techniques do not address the problem of synthesizing “in-between” frames—a critical part of ST-LFR. Our work is also related to superresolution techniques in computer vision. Although most approaches focus on increasing the spatial resolution (see [24] for a comprehensive review), a notable exception in [8] extends the notion of superresolution to the space-time domain. Their approach enables new visual capabilities of dynamic events such as the removal of motion blur and temporal antialiasing. However, they assumed that all images have already been registered in the space and time volume, and the scene is approximately planar. Our ST-LFR formulation does not require a priori image registration and does allow arbitrary scene and motion types. On the other hand, we do not attempt to increase the spatial resolution. One key component of ST-LFR is to find correspondences among images, which is achieved through optical flow computation over both space and time. Optical flow is a well-studied problem in computer vision. Interested readers are referred to [25] and [26] for reviews and comparison of different optical flow techniques. Although flow algorithms typically use only a single sequence of frames (two frames in most cases), there are a few extensions to the spatial-temporal domain. Vedula et al. introduced the concept of scene flow [27] to perform spatialtemporal voxel-based space carving [28] from a sparse image set. With the voxel model, they perform ray casting across space and time to generate novel views [9]. Under the context of LFR, Wilburn et al. [7] modified the original optical flow by iteratively warping the nearest four captured images across space and time toward the virtual view. Their goal of synthesizing ghosting-free views from images captured at different times was similar to ours.1 It is also acknowledged in [7] that the results suffered from the usual artifacts of optical flow. We go one-step further. Instead of just warping the optical flow, we develop a novel edge-guided temporal interpolation as a remedy for optical flow errors, thus increasing the robustness of view synthesis. 3 ALGORITHM OVERVIEW Given a set of video sequences captured by a calibrated camera array, our goal is to construct a space-time light field so that dynamic scenes can be rendered at arbitrary time from arbitrary viewing directions. We do not require those video cameras to be synchronized, and in fact, they do not even need to share the same capturing rate. We have developed a set of techniques to establish spatial-temporal correspondences across all images so that view synthesis can proceed. More specifically, our approach to ST-LFR has the following major steps, as shown in Fig. 2: image registration, temporal offset estimation (optional), synchronized image generation (temporal interpolation), and 4. LFR (spatial interpolation). In the first step (Section 4), we use a robust spatialtemporal optical flow algorithm to establish feature correspondences among successive frames for each camera’s video sequence. In cases when video sequences are not synchronized or relative temporal offsets are unknown, we developed a simple but robust technique to estimate the offsets (presented in Section 7), which in turn reduces mismatches. Once feature correspondences are established, new globally synchronized images are synthesized using a novel edgeguided image-morphing method (Section 5). Synchronized 1. 2. 3. 1. Note that [7] is in fact a concurrent development to ST-LFR. A preliminary version of this paper was published in March 2005 [29]. Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. 700 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 4, JULY/AUGUST 2007 video sequences are eventually used as inputs to synthesize images in the spatial domain using traditional LFR techniques (Section 6). 4 IMAGE REGISTRATION Image registration in this context means finding pairwise feature correspondences among input videos. It is the most critical part of the entire ST-LFR pipeline. We first discuss how to find feature correspondences given two frames using optical flow, then show how the epipolar constraint can be used to detect feature correspondence errors, once the temporal offset between two frames is learned (Section 7). Throughout the paper, we use Ci to denote cameras where the subscript is the camera index, and Ii ðtÞ to denote an image taken from camera Ci at time t. Note that t does not have to be based on a real clock. We typically use a frame time (that is, 1/30 sec) as the unit time. 4.1 Bidirectional Flow Computation Given two images Ii ðt1 Þ and Ij ðt2 Þ, we calculate optical flows in both directions: forward from Ii ðt1 Þ to Ij ðt2 Þ and backward from Ij ðt2 Þ to Ii ðt1 Þ. Note that the two frames can be captured by either the same camera at different times, resulting in the temporal flow, or two different cameras, resulting in the spatial flow. The bidirectional flow computation provides more feature correspondences compared to the one-way flow. To acquire the forward flow, we first select corners with large eigenvalues on Ii ðt1 Þ as feature points using a Harris corner detector [30]. We enforce a minimum distance between any two feature points to prevent them from gathering in a small high-gradient region. We then calculate the sparse optical flow from Ii ðt1 Þ to Ij ðt2 Þ using tracking functions in Open Source Computer Vision Library (OpenCV) [31], which is based on the classic Kanade-Lucas-Tomasi (KLT) feature tracker [32], [33]. Similarly, to get the backward flow from Ij ðt2 Þ to Ii ðt1 Þ, we first find features on image Ij ðt2 Þ, then calculate the optical flow from Ij ðt2 Þ and Ii ðt1 Þ. Redundant feature correspondences from both flow computations are removed once they are found to be too close to each other. We choose not to calculate a dense per-pixel flow since there are certain areas such as occlusion boundaries and textureless regions where flow computation is known to be problematic. Our sparse flow formulation is not only more robust, but also more computationally efficient. Moreover, since our temporal interpolation algorithm is based on salient edge features on boundaries, which are mostly covered by our sparse flows, occluded and textureless regions can still be correctly synthesized even if there are no feature correspondences calculated within. 4.2 Outlier Detection Using the Epipolar Constraint Given two frames Ii ðt1 Þ and Ij ðt2 Þ ði 6¼ jÞ, we can use the epipolar constraint to detect and later remove wrong feature correspondences. The epipolar constraint dictates that, for any 3D point P in the world space, P ’s homogenous projection locations pi and pj on two cameras should satisfy the equation pi Fpj ¼ 0; ð1Þ Fig. 3. The epipolar constraint is applied to detect bad matches in the spatial-temporal correspondences, assuming an unsynchronized input. where F is called the fundamental matrix, which can be determined by camera projection matrices. If we consider this equation solely in 2D image space, then pi F can also be explained as a 2D epipolar line in camera Cj , and (1) means that pj must be on the epipolar line pi F, and vice versa. When Ii ðt1 Þ and Ij ðt2 Þ are captured at the same time (that is, t1 ¼ t2 ), we assume that feature correspondences pi and pj belong to the projections of the same 3D point so that we can directly use (1) to justify whether the correspondence is correct. Due to various error sources such as inaccurate camera calibration and camera noise, we set an error threshold . If jpi Fpj j > , we consider this feature correspondence to be wrong and remove it from our correspondence list. When Ii ðt1 Þ and Ij ðt2 Þ are captured at different but known time instants, we can use linear interpolation to estimate the feature location over time, assuming the motion is temporarily linear. As shown in Fig. 3, given the temporal correspondence pi ðt1 Þ and pi ðt1 þ tÞ, pi ’s location at t2 ðt1 < t2 < t1 þ tÞ is pi ðt2 Þ ¼ t1 þ t t2 t2 t1 pi ðt1 Þ þ pi ðt1 þ tÞ: t t ð2Þ Using the interpolated pi ðt2 Þ, the spatial correspondence between pi ðt1 Þ and pj ðt2 Þ can be verified as before: j pi ðt2 ÞFpj ðt2 Þ j : ð3Þ If a pair of interpolated spatial correspondence does not satisfy the epipolar constraint, it means that either the temporal correspondence or spatial correspondence, or both might be wrong. Since we cannot tell which one may possibly be correct, we will discard both of them. In this way, the epipolar constraint is used to detect not only Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING 701 reference frames in order to remove correspondence errors without causing too many false negatives. Reference cameras spatially close to the original camera (that is, camera i) in the camera array are good candidates since there would be less spatial correspondence errors caused by occlusions. The ambiguity along the epipolar line should also be brought into consideration: When the feature is moving horizontally, cameras on the same horizontal line should not be used since the feature is almost staying on the horizontal epipolar line. In addition, for camera arrays built on regular grids, it is not necessary to use more cameras on the same column or row because they will share the similar epipolar line. Given a selected reference camera, in order to fully reveal correspondence errors calculated by the forward flow, we prefer reference frames with capturing time t2 close to t1 þ t, because the flow is calculated from known feature locations on Ii ðt1 Þ to Ii ðt1 þ tÞ. Similarly, we prefer reference frames captured near t1 when examining correspondences calculated by the backward flow. In order to evaluate camera i’s temporal correspondence, our reference camera selection scheme calculates a quality value wj for each camera j using the following equation: wj ¼ DðjÞ þ jt2 tj; Fig. 4. Feature points and epipolar line constraints. Red labels indicate wrong correspondences, whereas blue labels indicate right correspondences. (a) pi ðt1 Þ on image Ii ðt1 Þ. (b) pi ðt1 þ tÞ on Ii ðt1 þ tÞ. (c) pj ðt2 Þ and pi ðt2 Þ’s epipolar lines on image Ij ðt2 Þ. ð4Þ where DðjÞ measures the spatial closeness between camera i and camera j, t2 is the closest temporal frame to t from camera j’s sequence, that is, t2 ¼ arg min jt2 tj; ð5Þ t2 2Sequence j spatial correspondence errors, but also temporal correspondence errors. This error detection method assumes that the motion pattern is roughly linear in the projective space. Many realworld movements, including rotation, do not strictly satisfy this requirement. Fortunately, when cameras have a sufficiently high rate with respect to the 3D motion, such a locally temporal linearization is generally acceptable [34]. Our experimental results also support this assumption. In Fig. 4, correct feature correspondences satisfy the epipolar constraint well even though the magazine was rotating fast. Fig. 4 also demonstrates the amount of pixel offsets that casual motion could introduce: In two successive frames captured at 30 fps, many feature points move more than 20 pixels—a substantial amount that cannot be ignored in view synthesis. 4.3 Detecting Correspondence Errors in Multicamera Systems As discussed above, we need to have three frames in order to detect bad matches. Given a temporal pair of Ii ðt1 Þ and Ii ðt1 þ tÞ, we need to select a reference camera j that captures Ij ðt2 Þ with t1 t2 t1 þ t. Ideally, the more reference cameras we use, the more errors we can detect. However, the number of false negatives increases with the number of reference cameras as well; for example, correct temporal correspondences are removed due to wrong spatial correspondences caused by reference cameras. Therefore, we need a selection scheme to choose good and is a parameter to balance the influence between spatial closeness and the time difference. Since we detect feature correspondence errors calculated from the backward flow and forward flow separately, correspondences constructed by different flows use different reference frames and cameras too. t is either t1 or t1 þ t, depending on which flow is going to be examined. If the camera array is constructed on regular grids, DðjÞ can be simply expressed as the distance between array indices. In that case, we choose one reference camera along the column and one along the row to provide both horizontal and vertical epipolar constraints. When cameras share the same frame rate, reference cameras only have to be selected once. 5 TEMPORAL INTERPOLATION GUIDANCE WITH EDGE Once we have calculated the temporal correspondence between two successive frames from one camera, we would like to generate intermediate frames at any time for this camera so that synchronized video sequences can be produced for traditional LFR. One possible approach is to simply triangulate the whole image and blend two images by texture mapping. However, the result quality is quite unpredictable since it depends heavily on the local triangulation of feature points. Even a single feature mismatch can affect a large neighboring region on the image. Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. 702 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, Fig. 5. Real feature edges. (a) The gradient magnitude map. (b) Feature edges (blue) detected by testing feature points pairwise. Our robust temporal interpolation scheme, based on the image-morphing method [10], incorporates varying weights using the image gradient. In essence, line segments are formed automatically from point correspondences to cover the entire image. Segments aligned with image edges will gain extra weights to preserve edge straightness—an important visual cue. Therefore, our method can synthesize smooth and visually appealing images even in the presence of missing or wrong feature correspondence. VOL. 13, NO. 4, JULY/AUGUST 2007 Fig. 6. The optical flow (a) before and (b) after applying the epipolar constraint. Correspondence errors on the top of the magazine are detected and removed. ð6Þ temporal correspondences. Nevertheless, the edge length change may be caused by reasonable point correspondences as well. For instance, in Fig. 6, the dashed blue arrows show the intersection boundaries between the magazine and the wall. They are correctly tracked; however, since they do not represent real 3D points, they do not follow the magazine motion at all. Edges that ended with those points can be detected from length changes and can be removed to avoid distortions during temporal interpolation. Similarly, we do not consider segments connecting one static point (that is, a point that does not move in two temporal frames) with one dynamic point due to the segment length change. A detail worth mentioning here is that we included a maximum edge length constraint Lmax in our implementation. This constraint has nothing to do with the edge correspondence correctness but rather serves to accelerate the edge correspondence detection. Given a point feature p as an endpoint of a possible edge, the other endpoint must be contained in the circle centered at p with radius Lmax . We build a kD-tree structure for feature points to quickly discard any points outside of that circle. When Lmax is relatively small compared with the frame resolution, the computational cost can be greatly reduced from Oðn2 Þ to OðmnÞ, where n is the total number of feature points, and m is the average number of feature points clustered in circles with radius Lmax . The first term in (6) gives the gradient magnitude and the second term indicates how well the gradient orientation matches with the edge normal direction NðeÞ. is a parameter to balance the influence between two terms. We calculate the fitness values for both ei ðtÞ and ei ðt þ tÞ. If both values are greater than a threshold, it means that both ei ðtÞ and ei ðt þ tÞ are in strong gradient regions, and the gradient direction matches the edge normal. Therefore, ei ðtÞ and ei ðt þ tÞ may represent a real surface or texture boundary in the scene. Fig. 5 shows a gradient magnitude map and detected feature edges. The second factor is the edge length. We assume that the edge length is nearly constant from frame to frame, provided that the object distortion is relatively slow compared with the frame rate. This assumption is used to discard edge correspondences if jei ðt þ tÞj changes too much from jei ðtÞj. Most changes in edge length are caused by wrong 5.2 Forming the Edge Set Since the real edge correspondences generated from Section 5.1 may not be able to cover the whole region sometimes, we add virtual edge correspondences to the edge set to avoid distortions caused by sparse edge sets. Virtual edges are created by triangulating the whole image region using constrained Delaunay Triangulation [36]. We treat the existing edge correspondences as edge constraints and use the endpoints of each edge plus four image corners as input vertices. Note that we have to break intersecting feature edges since no intersection is allowed among edge constraints for triangulation. Fig. 7 shows a sample triangulation result. Sometimes, more than two feature points may be selected on a thick and high-gradient band, so that more than one feature edge may be generated (as shown in Fig. 8) even though the band represents the same edge in the world space. This is problematic since those extra feature edges 5.1 Extracting Edge Correspondences We first construct segments by extracting edge correspondences. Given a segment ei ðtÞ connecting two feature points on image Ii ðtÞ and a segment ei ðt þ tÞ connecting the corresponding feature points on Ii ðt þ tÞ, we would like to know whether this segment pair ðei ðtÞ; ei ðt þ tÞÞ are both aligned with image edges, that is, forming an edge correspondence. Our decision is based on two main factors: edge strength and edge length. Regarding edge strength, our algorithm examines whether a potential edge matches with both image gradient magnitude and orientation, similar to Canny edge detection method [35]. On a given segment, we select a set of samples s1 ; s2 ; . . . ; sn uniformly and calculate the fitness function for this segment as fðeÞ ¼ minðjrIðsk Þj þ sk 2e rIðsk Þ NðeÞÞ: jrIðsk Þj Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING 703 Fig. 7. The edge map constructed after constrained Delaunay triangulation. will repetitively apply more influence weights than they should during the next image-morphing step. We detect this by testing whether real feature edges are nearly parallel and adjacent. When detected, we keep only the longest one. 5.3 Morphing We extend the image-morphing method [10] to interpolate two frames temporally. We allow real edges to have more influence weights than virtual edges since real edges are believed to have physical meanings in the real world. In [10], the edge weight was calculated using the edge length L and the point-edge distance D: b L ; ð7Þ weight0 ¼ ða þ DÞ Fig. 9. Interpolation result (a) without and (b) with virtual feature edges. Fig. 9 shows the results without and with virtual edges. Fig. 10 shows the results using different interpolation schemes. Since some features are missing on the top of the magazine, the interpolation quality improves when real feature edges get extra weights according to (8). Even without additional weights, image morphing generates more visually pleasing images in the presence of bad feature correspondences. 6 VIEW SYNTHESIS (SPATIAL INTERPOLATION) Once we generate synchronized images for a given time instant, we can use traditional LFR techniques to synthesize views from novel viewpoints. To this end, we adopt the unstructured lumigraph rendering (ULR) technique [37], where a, b, and are constants to affect the line influence. We calculate the weight for both real and virtual feature edges using the formula above. The weight for the real edge is then further boosted as fðeÞ fmin ðeÞ ; ð8Þ weight ¼ weight0 1 þ fmax ðeÞ fmin ðeÞ where fðeÞ is the edge samples’ average fitness value from (6). fmin ðeÞ and fmax ðeÞ are the minimum and maximum fitness value among all real feature edges, respectively. is a parameter to scale the boosting effect exponentially. We temporally interpolate two frames using both the forward temporal flow from Ii ðtÞ to Ii ðt þ tÞ and backward temporal flow from Ii ðt þ tÞ to Ii ðtÞ. The final pixel intensity is calculated by cross dissolving: Pt0 ¼ Pforward t þ t t0 t0 t ; þ Pbackward t t ð9Þ where Pforward is the pixel color calculated only from frame Ii ðtÞ, and Pbackward is the pixel color calculated only from frame Ii ðt þ tÞ. This is because we have more confidence with Ii ðtÞ’s features in the forward flow and Ii ðt þ tÞ’s features in the backward flow. These features are selected directly on images. Fig. 8. Three feature points p0 , p1 , and p2 are selected on the same dark boundary, forming three edges e0 , e1 , and e2 . We only keep the longest one e2 , which represents the entire edge in the real world. Fig. 10. The interpolated result using different schemes. (a) Image morphing without epipolar test. (b) Direct image triangulation with epipolar test. (c) Image morphing with epipolar test, but without extra edge weights. (d) Image morphing with both epipolar test and extra edge weights. Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. 704 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 4, JULY/AUGUST 2007 Fig. 12. An illustration of three possible cases for the temporal offset ðt~Þ between two frames. (a) shows a feature point in camera Cj at time t0 ; its epipolar line in Ci can intersect the feature trajectory (from t to t þ t) in Ci in three ways, as shown in (b). Fig. 11. A reconstruct 3D mesh proxy. (a) Oblique view. (b) Front view. which can utilize graphics texture hardware to blend appropriate pixels together from the nearest cameras in order to compose the desired image. ULR requires an approximation of the scene (a geometric proxy) as an input. We implemented two options. The first is a 3D planar proxy in which the plane’s depth can be interactively controlled by the user. Its placement determines which region of the scene is in focus. The second option is a reconstructed proxy using the spatial correspondences. For unsynchronized input, the correspondences are interpolated to the same instant. The 3D positions can then be triangulated. In order to reconstruct a proxy that covers the entire image, we also estimate a backplane based on the reconstructed 3D vertices. The plane is positioned and oriented so that all the vertices are in the front. Finally, all the vertices (including the four corners of the plane) are triangulated to create a 3D mesh. Fig. 11 shows an example mesh. Whether to use a plane or a mesh proxy in rendering depends mainly on the camera configuration and scene depth variation. As we will show later, in Section 8, the planar proxy works well for scenes with small depth variations. On the other hand, using a reconstructed mesh proxy improves both the depth of field and the 3D effect for oblique-angle viewing. 7 ESTIMATION OF TEMPORAL OFFSET With unsynchronized input, the temporal offset between any two sequences is required in many aspects of our system. This offset can be obtained in several ways. Some types of cameras such as the ones from Point Grey Research [38] can provide a global time stamp for every frame. One can also insert a high-resolution clock into the scene as a synchronization object. If neither approach is applicable, we present here a robust technique to estimate the temporal offset in software. Again, our technique is based on the epipolar constraint. We estimate the offset pairwise by the temporal correspondence ðpi ðtÞ; pi ðt þ tÞÞ and spatial correspondence ðpi ðtÞ; pj ðt0 ÞÞ from three frames. We know that the temporal offset t~ ¼ t0 t satisfies pi ðt0 Þ Fij pj ðt0 Þ ¼ 0; t t~ t~ pi ðtÞ þ pi ðt0 Þ ¼ pi ðt þ tÞ; t t ð10Þ where Fij is the fundamental matrix between Ci and Cj , and pi ðt0 Þ is the estimated location at time t0 , assuming temporary linear motion. If all feature points are known, the equation system above can be organized as a single linear equation of one unknown t~. Geometrically, this equation finds the intersection between pj ðt0 Þ’s epipolar line and the straight line defined by pi ðtÞ and pi ðt þ tÞ. As shown in Fig. 12, if the intersection happens between pi ðtÞ and pi ðt þ tÞ, that means 0 < t~ < t; if the intersection happens before pi ðtÞ, that means t~ < 0; and if the intersection happens beyond pi ðt þ tÞ, then t~ > t. Ideally, given frames Ii ðtÞ and Ij ðt0 Þ, we can calculate the offset t~ using just a single feature in (10). Unfortunately, the flow computation is not always correct in practice. To provide a more robust and accurate estimation of the temporal offset, we design our method as four steps: select salient features on image Ii ðtÞ, calculate the temporal flow from Ii ðtÞ to Ii ðt þ tÞ and the spatial flow from Ii ðtÞ to Ij ðt0 Þ using the technique described in Section 4.1, 2. calculate the time offset t~½1 ; t~½2 ; . . . t~½N for each feature P½1 ; P½2 ; . . . P½N using (10), 3. detect and remove time offset outliers using the Random Sample Consensus (RANSAC) algorithm [39], assuming that outliers are primarily caused by random optical flow errors, and 4. calculate the final time offset using a Weighted Least Squares (WLS) method, given the remaining inliers from RANSAC. We enumerate all possible offsets (each feature correspondence gives one offset estimation) as offset candidates in RANSAC. For each candidate, we determine inliers as the consensus subset of estimated offsets that are within a tolerance threshold d from the candidate offset. d is typically 0:05t in our experiments, where t is usually one-frame time, that is, 1/30 sec. We then treat the largest consensus subset as inliers among initial offsets. Even after we remove outliers using RANSAC, offsets computed from inliers can still vary because of various errors. Therefore, we further refine the temporal offset estimation from the remaining inliers using a WLS approach. The cost function is defined as 1. C¼ M X wk ðt~ t~k Þ2 ; ð11Þ k¼1 where M is the total number of remaining inliers, and the weight factor wk is ~ ~ wk ¼ ejttk j ; 0: Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. ð12Þ WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING 705 Fig. 13. The space-time light field rendering pipeline. The weight factor is sensitive when is large; otherwise, the formulation is close to a standard least squares fitting as goes to zero. We can now reformulate (11) as a weighted average: M X dfðt~Þ ¼2 wk ðt~ t~k Þ ¼ 0; dt~ k¼1 t~ ¼ M X k¼1 wk t~k = M X ð13Þ wk : k¼1 We first calculate the mean value of all the inlier offsets as the initial offset. Then, during each regression step, we recalculate the weight factor for each inlier and then the weighted average offset t~. Since RANSAC has already removed most outliers, the WLS fitting can converge fast. We have empirically found out several criteria to determine how good it is to choose a certain frame Ii ðtÞ for the offset estimation. First, feature motions should be sufficiently large. The estimation would be more difficult and inaccurate if the scene is almost static. Second, we prefer feature motions to be perpendicular to the epipolar line, since the offset t~ is found by solving the intersection of the feature trajectory and the epipolar line. Third, we may choose frames with more salient features and less occlusion to avoid correspondence errors. In general, we typically choose good short sequences, determine the rough alignment, and then calculate the estimated t~ for each frame Ii ðtÞ. If we can further assume that a sequence is captured at a regular interval without dropped frames, we can take the average of all estimations to get the final temporal offset for the entire sequence. The estimated temporal offset can further be used to remove erroneous correspondences as outlined in Section 4.2. 8 RESULTS More results can be found in the accompanying video or online at http://vis.uky.edu/~ryang/media/ST-LFR/. We have implemented our ST-LFR framework and tested it with real data. In Fig. 13, we show a graphical representation of our entire processing pipeline. To facilitate data capturing, we built two multicamera systems. The first system uses eight Point Grey Dragonfly cameras [38]: three color ones and five gray-scale ones (see Fig. 1a). The mixed use of different cameras is due to our resource limit. This system can provide a global time stamp for each frame in hardware. The second system includes eight Sony color digital FireWire cameras, but the global time stamp is not available. In both systems, the cameras are approximately 60 mm apart, limited by their form factor. Based on the analysis in [40], the effective depth of the field is about 400 mm. The cameras are calibrated, and all images are rectified to remove lens distortions. Since our cameras are arranged in a horizontal linear array, we only select one camera for the epipolar constraint according to the discussion before, in Section 4.3. The closeness is just the camera position distance. We first demonstrate our results from the data tagged with a global time stamp, that is, the temporal offsets are known. The final images are synthesized using either a planar proxy or a reconstructed proxy from spatial correspondences. Then, we quantitatively study the accuracy of our temporal offset estimation scheme. Finally, we put everything together by showing synthesis results from a data set without a global time stamp, followed by a discussion of our approach. 8.1 View Synthesized with a Planar Proxy To better demonstrate the advantage of ST-LFR over the traditional LFR, we first show some results using a planar proxy for spatial interpolation. The data sets used in this section were captured by the Point Grey camera system with known global time stamps. Our first data set includes a person waving his hand, as Fig. 1 shows. Since the depth variation from the hands to the head is slightly beyond the focus range (400 mm), we can see some tiny horizontal ghosting effects in the synthesized image, which are entirely different from vertical mismatches caused by the hand’s vertical motion. We will show later that these artifacts can be reduced by using a reconstructed proxy. We emphasize the importance of virtual synchronization in Fig. 14, in which the image is synthesized with a constant blending weight (1/8) for all the eight cameras. The scene contains a moving magazine. Without virtual synchronization (Fig. 14a), the text on the magazine cover is illegible. This problem is rectified after we registered feature points and generated virtually synchronized images. Our next data set is a moving open book. In Fig. 15, we synthesized several views using traditional LFR and Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. 706 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 4, JULY/AUGUST 2007 Fig. 14. Synthesized results using a uniform blending weight (that is, pixels from all frames are averaged together). (a) Traditional LFR with unsynchronized frames. (b) ST-LFR. ST-LFR. Results from ST-LFR remain sharp from different viewpoints. The noticeable change of intensity is due to the mixed use of color and gray-scale cameras. 8.2 View Synthesized with a Reconstructed Proxy As mentioned in Section 6, building a 3D mesh geometry proxy improves the synthesis results for scenes with large depth range. This method is adopted in the following two data sets, and the results are compared with plane proxy. The first data set is the open book, as shown in Fig. 16. It is obvious that the back cover of the book in this example is better synthesized with the 3D mesh proxy for its improved depth information. The second example shown in Fig. 17 is a human face viewed from two different angles. The face (a) rendered with the planar proxy shows some noticeable distortions in side views. The comparative results are better illustrated in the accompanying video. 8.3 Experiments on Temporal-Offset Estimation All the above results are obtained from data sets in which the global time stamp for each frame is known. However, since most commodity cameras do not provide such Fig. 15. Synthesized results of the book example. (a) Traditional LFR with unsynchronized frames. (b) ST-LFR. Fig. 16. Top row: synthesized results of the book sequence. Bottom row: the reconstructed proxy. (a) Plane-proxy adopted. (b) Three-dimensional reconstructed mesh proxy adopted. capability, the temporal offsets of sequences captured by those cameras need to be estimated first. We now evaluate the accuracy of our estimation technique discussed in Section 7. We first tested with a simulated data set in which feature points are moving randomly. We add random correspondences (outliers) and perturb the ground-truth feature correspondences (inliers) by adding a small amount of Gaussian noise. Fig. 18 plots the accuracy of our approach with various outlier ratios. It shows that our technique is Fig. 17. Synthesized results of a human face from two different angles. (a) Plane proxy adopted. (b) 3D mesh proxy adopted. Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING Fig. 18. Temporal offset estimation error using synthetic tracked features. The ratio of outliers varies from 10 percent to 90 percent. Using three variations of our techniques, the relative errors to the ground truth are plotted. The error is no greater than 15 percent in any test case. extremely robust even when 90 percent of the offsets are outliers. Our second data set is a pair of video sequences with the ground-truth offset. This is obtained from our Point Grey camera system. Each sequence, captured at 30 fps, contains 11 frames. The temporal offset between the two is 10 ms, that is, 0.3 frame time. Fig. 19 shows the estimated temporal offset; the error is typically within 0.05 frame time. The average epipolar distance for inliers is less than 1 pixel. From the above experiments with known ground truth, we can see that our approach can produce very accurate and robust estimation of the temporal offset. In addition, combining RANSAC and WLS fitting yields better results than either of these techniques alone. The second formula in (10) is a linear approximation of pi ðt0 Þ using the feature correspondence pi ðtÞ and pi ðt þ tÞ. Although object motions in most scenes are nonlinear, our experimental result indicates that the linear approximation is valid for sequences captured at video rate, that is, greater than 30 fps. We did, however, observe that the best estimation was obtained when t~ satisfied 0 t~ t. Therefore, given frame Ii ðtÞ, we need to choose frame Ij ðt0 Þ that is captured between Ii ðtÞ and Ii ðt þ tÞ. Given a pair of roughly aligned sequences, we search several frames ahead and, afterward, find the appropriate pair frames. We also compare our method with the approach by Zhou and Tao [23]. Their method, which can also achieve subframe accuracy, uses four corresponding points (pi ðtÞ, pi ðt þ tÞ, pj ðt0 Þ, and pj ðt0 þ tÞ) to compute the cross ratio to obtain offset t~. Instead of using all four correspondence points, we simply use three, excluding the point pj ðt0 þ tÞ in our method. Experiments show that our method usually performs slightly better. The advantage of using only three frames is obvious when there are dropped frames. Fig. 19. Estimated time offset using real-image sequences. The magenta line shows the exact time offset (0.3 frame time or 10 ms). Each of the test sequences is 11 frames long. 707 Fig. 20. Estimated time offset using synthetic image sequences. The blue line shows the exact time offset (0.5 frame time). Each of the test sequences is 31 frames long. Our first data set is synthetic sequences with a dynamic scene generated by OpenGL. The dynamic scene consists of a static background and a moving rectangle. Both surfaces are well textured. Images were synthesized at two viewpoints mimicking a standard stereo setup. We render the scene from two viewpoints with a temporal offset of half a frame, that is, t~ ¼ 0:5. The estimation is shown in Fig. 20. Our second data set is the same as the first one except that it contains lost frames. As shown in Fig. 21, the sequence from one of the viewpoints lost the sixth frame. Thus, the ground-truth time offset jumps from half a frame to one and a half frame. Our method gains an advantage over that in [23] in the fifth and seventh frames. Our last data set is a real pair of rotating magazine sequences with a known offset (shown in Fig. 6). Results are shown in Fig. 22. In this sequence, our estimates are generally better. This example also shows an interesting setup that both methods failed. In that frame (22), the movements of the magazine are almost parallel to the epipolar line—a degenerated case for both methods. 8.4 Data with Unknown Temporal Offset Our last data set is captured with the SONY color camera system without global time stamp. It contains a swinging toy ball. We started all cameras simultaneously to capture a set of roughly aligned sequences. Then, we use the technique introduced in Section 7 to estimate a temporal offset for each sequence pair, using the sequence from the middle camera as the reference. Since we do not know the ground truth, we can only verify the offset accuracy by looking at the epipolar distances for temporally interpolated feature Fig. 21. Estimated time offset using synthetic image sequences with the sixth and seventh frames lost. The blue line shows the exact time offset (0.5 frame time or 10 ms). Each of the test sequences is nine frames long. Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. 708 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 13, NO. 4, JULY/AUGUST 2007 Fig. 22. Estimated time offset using real-image sequences. The blue line shows the exact time offset (0.3 frame time or 10 ms). Each of the test sequences is 31 frames long. points. Fig. 23a shows the temporal optical flow seen on a frame Ii ðtÞ, and Fig. 23b shows the pi ðt0 Þ’s epipolar lines created using our estimated time offset on the frame Ij ðt0 Þ. Note that the spatial correspondences pj ðt0 Þ are right on the epipolar lines, which verifies the correctness of our estimated temporal offset. Using a planar proxy, the synthesized results are shown in Fig. 24. A visualization of the motion trajectory of the ball can be found in the last row of Fig. 1. 8.5 Discussion In comparison with the traditional LFR, our formulation requires correspondence information. Finding correspondence is an open problem in computer vision, and it can be fragile in practice. Although we have designed several robust techniques to handle almost inevitable mismatches and experimental results have been quite encouraging, we have not studied our techniques’ stability over long sequences yet. On the other hand, the biggest problem in matching—textureless regions [41]—is circumvented in our formulation since we only compute a sparse flow field that is sufficient to synthesize novel views. Table 1 lists some parameters discussed earlier and their empirical value ranges used in our experiment. Since the epipolar constraint is independent of the data set, the epipolar constraint threshold always falls into the threepixel to five-pixel range. However, the reference camera selection parameter is sensitive to both the camera array settings and captured scenes. The effects and choices of a, b, and in the morphing (7) are well explained in [10]. Parameters , , and , though having different units, are all used to boost some exponential weights. We typically Fig. 24. Synthesized results of the toy ball sequence. Note that the input data set does not contain global time stamps. (a) Traditional LFR with unsynchronized frames. (b) ST-LFR, using estimated temporal offsets among input sequences. adjust those weighting parameters so that the largest weight is 1,000 times greater than the smallest weight value. This works well for most data sets we have tried so far. Regarding speed, our current implementation is not real time yet. The average time for synthesizing one single frame takes a few minutes, depending on the number of detected features. The main bottleneck is the temporal interpolation step. To form the matching edge set, it also requires a quite exhaustive search, even after some acceleration structures are adopted. 9 CONCLUSION AND FUTURE WORK In this paper, we extend the traditional LFR to the temporal domain to accommodate dynamic scenes. Instead of capturing the dynamic scene in strict synchronization and treating each image set as an independent static light field, our notion of a space-time light field simply assumes a collection of video sequences. These sequences may or may not be synchronized, and they can have different capture rates. In order to be able to synthesize novel views from any viewpoint at any time instant, we developed a two-stage rendering algorithm: 1) temporal interpolation that registers successive frames with spatial-temporal optical flow and generates synchronized frames using an edge-guided image-morphing algorithm that preserves important edge TABLE 1 Parameters and Their Typical Values Fig. 23. Verifying estimated time offset using epipolar lines. (a) Temporal optical flow on camera Ci . (b) The epipolar line of pi ðt0 Þ (which is linearly interpolated based on the temporal offset) and pj ðt0 Þ on camera Cj . Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. WANG ET AL.: SPACE-TIME LIGHT FIELD RENDERING features and 2) spatial interpolation that uses ULR to create the final view from a given viewpoint. Experimental results have shown that our approach is robust and capable of maintaining photorealistic results comparable to traditional static LFR even when the dynamic input is not synchronized. Furthermore, by allowing blending over the entire space-time light field space, we demonstrated a novel visual effect such as the visualization of motion trajectory in 3D, which cannot be produced with traditional LFR. We also present a technique to find the temporal alignment between sequences, in cases when the global time stamp is not available. Our approach is very accurate and extremely robust. It can estimate a temporal offset with subframe accuracy even when a very large portion (up to 90 percent) of feature matching is incorrect. Our technique can be applied in many other areas, such as stereo vision and multiview tracking, to enable the use of unsynchronized video input. Looking into the future, we plan to investigate a more efficient interpolation method so that we can render spacetime light field in real time and online, without significantly losing the image qualities. This real-time capability, together with the use of inexpensive Web cameras, can contribute to a wider diffusion of light field techniques into many interesting applications, such as 3D video teleconferencing, remote surveillance, and telemedicine. ACKNOWLEDGMENTS The authors would like to thank Greg Turk for suggestions and Sifang Li for participation in the experiments. This work was done while the authors were at the University of Kentucky and is supported in part by the University of Kentucky Research Foundation, the US Department of Homeland Security, and US National Science Foundation Grant IIS-0448185. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] M. Levoy and P. Hanrahan, “Light Field Rendering,” Proc. Int’l Conf. Computer Graphics and Interactive Techniques (SIGGRAPH ’96), pp. 31-42, Aug. 1996. S.J. Gortler, R. Grzeszczuk, R. Szeliski, and M.F. Cohen, “The Lumigraph,” Proc. Int’l Conf. Computer Graphics and Interactive Techniques (SIGGRAPH ’96), pp. 43-54, Aug. 1996. T. Naemura, J. Tago, and H. Harashima, “Realtime Video-Based Modeling and Rendering of 3D Scenes,” IEEE Computer Graphics and Applications, vol. 22, no. 2, pp. 66-73, 2002. J.C. Yang, M. Everett, C. Buehler, and L. McMillan, “A Real-Time Distributed Light Field Camera,” Proc. 13th Eurographics Workshop Rendering, pp. 77-86, 2002. B. Wilburn, M. Smulski, H. Lee, and M. Horowitz, “The Light Field Video Camera,” Media Processors 2002, Proc. SPIE, vol. 4674, pp. 29-36, 2002. W. Matusik and H. Pfister, “3D TV: A Scalable System for RealTime Acquisition, Transmission, and Autostereoscopic Display of Dynamic Scenes,” Proc. Int’l Conf. Computer Graphics and Interactive Techniques (SIGGRAPH ’04), pp. 814-824, 2004. B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy, “High Performance Imaging Using Large Camera Arrays,” ACM Trans. Graphics, vol. 24, no. 3, pp. 765-776, 2005. E. Shechtman, Y. Caspi, and M. Irani, “Increasing Space-Time Resolution in Video,” Proc. Seventh European Conf. Computer Vision (ECCV ’02), pp. 753-768, 2002. S. Vedula, S. Baker, and T. Kanade, “Image-Based SpatioTemporal Modeling and View Interpolation of Dynamic Events,” ACM Trans. Graphics, vol. 24, no. 2, pp. 240-261, 2005. 709 [10] T. Beier and S. Neely, “Feature Based Image Metamorphosis,” Proc. Int’l Conf. Computer Graphics and Interactive Techniques (SIGGRAPH ’92), pp. 35-42, 1992. [11] E.H. Adelson and J. Bergen, “The Plenoptic Function and the Elements of Early Vision,” Computational Models of Visual Processing, pp. 3-20, Aug. 1991. [12] H.Y. Shum and L.W. He, “Rendering with Concentric Mosaics,” Proc. Int’l Conf. Computer Graphics and Interactive Techniques (SIGGRAPH ’97), pp. 299-306, 1997. [13] I. Ihm, S. Park, and R.K. Lee, “Rendering of Spherical Light Fields,” Pacific Graphics, 1998. [14] C. Zhang and T. Chen, “A Self-Reconfigurable Camera Array,” Proc. Eurographics Symp. Rendering, pp. 243-254, 2004. [15] T. Kanade, P.W. Rander, and P.J. Narayanan, “Virtualized Reality: Constructing Virtual Worlds from Real Scenes,” IEEE MultiMedia Magazine, vol. 1, no. 1, pp. 34-47, 1997. [16] W. Matusik, C. Buehler, R. Raskar, S. Gortler, and L. McMillan, “Image-Based Visual Hulls,” Proc. Int’l Conf. Computer Graphics and Interactive Techniques (SIGGRAPH ’00), pp. 369-374, Aug. 2000. [17] R. Yang, G. Welch, and G. Bisop, “Real-Time Consensus-Based Scene Reconstruction Using Commodity Graphics Hardware,” Proc. Pacific Graphics Conf., pp. 225-234, Oct. 2002. [18] C. Zitnick, S.B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High-Quality Video View Interpolation Using a Layered Representation,” ACM Trans. Graphics, vol. 23, no. 3, pp. 600-608, 2004. [19] Y. Caspi and M. Irani, “A Step Towards Sequence to Sequence Alignment,” Proc. Conf. Computer Vision and Pattern Recognition (CVPR ’00), pp. 682-689, 2000. [20] Y. Caspi and M. Irani, “Alignment of Non-Overlapping Sequences,” Proc. Int’l Conf. Computer Vision (ICCV ’01), pp. 76-83, 2001. [21] C. Rao, A. Gritai, M. Shah, and T. Syeda-Mahmood, “ViewInvariant Alignment and Matching of Video Sequences,” Proc. Int’l Conf. Computer Vision (ICCV ’03), pp. 939-945, 2003. [22] P. Sand and S. Teller, “Video Matching,” ACM Trans. Graphics, vol. 23, no. 3, pp. 592-599, 2004. [23] C. Zhou and H. Tao, “Depth Computation of Dynamic Scenes Using Unsynchronized Video Streams,” Proc. Conf. Computer Vision and Pattern Recognition (CVPR ’03), pp. 351-358, 2003. [24] S. Borman and R. Stevenson, “Spatial Resolution Enhancement of Low-Resolution Image Sequences—A Comprehensive Review with Directions for Future Research,” technical report, Laboratory for Image and Signal Analysis (LISA), Univ. of Notre Dame, 1998. [25] J.R. Bergen, P. Anandan, K.J. Hanna, and R. Hingorani, “Hierarchical Model-Based Motion Estimation,” Proc. Second European Conf. Computer Vision (ECCV ’92), pp. 237-252, 1992. [26] J. Barron, D.J. Fleet, and S.S. Beauchemin, “Performance of Optical Flow Techniques,” Int’l J. Computer Vision, vol. 12, no. 1, pp. 43-77, 1994. [27] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade, “Three-Dimensional Scene Flow,” Proc. Int’l Conf. Computer Vision (ICCV ’99), vol. 2, pp. 722-729, 1999. [28] K. Kutulakos and S.M. Seitz, “A Theory of Shape by Space Carving,” Int’l J. Computer Vision (IJCV), vol. 38, no. 3, pp. 199-218, 2000. [29] H. Wang and R. Yang, “Towards Space-Time Light Field Rendering,” Proc. 2005 Symp. Interactive 3D Graphics and Games (I3D ’05), pp. 125-132, 2005. [30] C.J. Harris and M. Stephens, “A Combined Corner and Edge Detector,” Proc. Fourth Alvey Vision Conf., pp. 147-151, 1988. [31] J.-Y. Bouguet, “Pyramidal Implementation of the Lucas Kanade Feature Tracker Description of the Algorithm,” technical report, 1999. [32] B.D. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” Proc. Int’l Joint Conf. Artificial Intelligence, pp. 674-679, 1981. [33] C. Tomasi and T. Kanade, “Detection and Tracking of Point Features,” Technical Report CMU-CS-91-132, Carnegie Mellon Univ., 1991. [34] L. Zhang, B. Curless, and S. Seitz, “Spacetime Stereo: Shape Recovery for Dynamic Scenes,” Proc. Conf. Computer Vision and Pattern Recognition (CVPR ’03), pp. 367-374, 2003. [35] J. Canny, “A Computational Approach to Edge Detection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 679-698, 1986. Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply. 710 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, [36] J.R. Shewchuk, “Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulato,” Applied Computational Geometry: Towards Geometric Eng., LNCS 1148, pp. 203-222, 1996. [37] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen, “Unstructured Lumigraph Rendering,” Proc. Int’l Conf. Computer Graphics and Interactive Techniques (SIGGRAPH ’01), pp. 405-432, Aug. 2001. [38] Point Grey Research, http://www.ptgrey.com, 2007. [39] M.A. Fischler and R.C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” Comm. ACM, vol. 24, pp. 381-395, 1981. [40] J.-X. Chai, X. Tong, S.-C. Chan, and H.-Y. Shum, “Plenoptic Sampling,” Proc. Int’l Conf. Computer Graphics and Interactive Techniques (SIGGRAPH ’00), pp. 307-318, 2000. [41] S. Baker, T. Sim, and T. Kanade, “When Is the Shape of a Scene Unique Given Its Light-Field: A Fundamental Theorem of 3D Vision,” IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), vol. 25, no. 1, pp. 100-109, 2003. Huamin Wang received the BS degree from Zhejiang University, China, in 2002 and the MS degree in computer science from Stanford University in 2004. He is now a PhD candidate in computer science at the Georgia Institute of Technology, working under the direction of Professor Greg Turk. His current work is focused on simulating small-scale fluid effects, texture synthesis, and light field rendering. He is a recipient of an NVIDIA fellowship in 2006–2007 and a student member of the IEEE Computer Society and the ACM. VOL. 13, NO. 4, JULY/AUGUST 2007 Mingxuan Sun received the BS degree from Zhejiang University, China, in 2004 and the MS degree in computer science from the University of Kentucky in 2006. She is now a PhD student in computer science at the Georgia Institute of Technology. Her current work is focused on shading correction and light field rendering. Prior to Georgia Tech, she has also worked in the Center for Visualization and Virtual Environments at the University of Kentucky. She is a student member of the IEEE. Ruigang Yang received the MS degree in computer science from Columbia University in 1998 and the PhD degree in computer science from the University of North Carolina at Chapel Hill in 2003. He is an assistant professor in the Computer Science Department at the University of Kentucky. His research interests include computer graphics, computer vision, and multimedia. He is a recipient of the US NSF CAREER award in 2004 and a member of the IEEE and the ACM. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on January 28, 2010 at 14:43 from IEEE Xplore. Restrictions apply.