A System for Real Time 3D Reconstruction by Daniel P. Snow Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2000 © Massachusetts Institute of Technology 2000. All rights reserved. ............ Author...................................... Department of Electrical Engineering and Computer Science May 19, 2000 Certified by.......v........... Accepted by .. .... .---- K............................. Paul Viola Associate Professor Thesis Supervisor .- ..... Arthur C. Smith Chairman, Department Committee on Graduate Students MASSACHIJSETTS INSTITUTE OF TECHNOLOGY JUN 2 2 2000 LIBRARIES 0 A System for Real Time 3D Reconstruction by Daniel P. Snow Submitted to the Department of Electrical Engineering and Computer Science on May 19, 2000, in partial fulfillment of the requirements for the degree of Master of Science Abstract Real-time 3D model generation is an exciting new area of computer vision research made possible by increased processor power, network speed and reduction in cost and availability of imaging hardware. In the near future, hardware specialists promise that cameras can made very cheaply. Multi-camera systems will become ubiquitous and will change how we view streaming media. A complete system that generates real-time 3D models presents many challenges. This paper focuses on calibration, silhouette extraction and model computation. It also touches upon data transfer and display. A set of solutions to these subproblems is presented as well as results from a real time system. Thesis Supervisor: Paul Viola Title: Associate Professor 2 Acknowledgments I would like to thank the following people in reverse alphabetical order for support, encouragement, good ideas and helping make this system happen: Ramin Zabih, Marta Wudlick, Paul Viola, Kinh Tieu, Chris Stauffer, Mike Ross, Owen Ozier, Erik Miller, Nick Matsakis, Victor Lum, John Fisher and Pedro Felzenszwalb 3 Contents 1 2 3 System Overview 12 1.1 Introduction ....... 1.2 Silhouette Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 ................................. 12 1.3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.3.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Calibration 21 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Calibration through Reprojection . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Calibration Estimation . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.3 Volume Intersection and Reprojection . . . . . . . . . . . . . . 28 2.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Silhouette Construction 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.1 3.2 34 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Silhouettes through Graphcuts . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2 Energy minimization and Graph Cuts . . . . . . . . . . . . . . 37 3.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4 3.3 4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Technique comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 52 54 Reconstruction without Silhouettes 4.1 Introduction . . . . . . . . . . . 54 4.2 Problem Formulation . . . . . . 56 4.2.1 Terminology . . . . . . . 56 4.2.2 The Energy Function . . 57 4.2.3 Energy minimization vs. silhouette intersection 57 4.2.4 Graph cuts . . . . . . . . 58 Experiments . . . . . . . . . . . 59 4.3.1 Graph Specification. . . 59 4.3.2 Performance . . . . . . . 60 4.3.3 Synthetic experiment . . 60 4.3.4 Real Experiments . . . . 61 4.3 4.4 5 3.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 67 Running the System 5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 R esu lts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A Data transfer and Fast Reconstruction 79 A.0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.0.2 Compression, Worldlines and Imagelines . . . . . . . . . . . 80 A.0.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 B Related Work 84 C Radial Distortion 86 5 List of Figures 1-la 8 of 16 images taken at a particular time instant to construct a 3D model........ ..................................... 12 1-1b 3 views from the 3D reconstruction . . . . . . . . . . . . . . . . . . . . 13 1-2 3 views from the 3D reconstruction at the next time instant . . . . . . 13 1-3 Left: A volume constructed from one silhouette and one camera. Right: The intersection of four volumes - two of which are shown to scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1-4 Volume carving example. Top left: Volume is filled. Top middle: One cameras effect on the volume. Top right: 2 cameras. Bottom left: 3 cameras, Bottom middle: 8 cameras. Bottom right: sixteen cameras used to carve the volume. . . . . . . . . . . . . . . . . . . . . 17 1-5 Top left: Volume constructed with two cameras which are opposite each other. Top right: Volume constructed with two cameras which are at a 90 degree angle from each other. Bottom: Volume constructed with three cameras, two opposite, and one 90 degrees from both of them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1-6 Left: Box with concavity. Right: Box with rectangular hole. . . . . . . 18 1-7 A 3D volume acquisition area set up in the lab environment. The acquisition volume is roughly 2 meters square, and is surrounded by 16 cameras. The camera which are visible in this image are highlighted with red circles. A figure in the center of the Capture Cube is in motion while the cameras are taking synchronous pictures. . . . . 20 6 2-1 A camera which has been rotated 17 degree from its "true" position. The cone originating from the uncalibrated camera (in another color) is not coincident with the object at all. For comparison, the cone from the correctly calibrated camera is also shown. . . . . . . . 21 2-2 Illustration of the projective geometry of a camera in 2D. . . . . . . . 22 2-3 A three dimensional camera scene. . . . . . . . . . . . . . . . . . . . . 23 2-4 Camera parameters of the right most camera are adjusted while the pixel used to construct the ray remains fixed for comparison with the "true" calibration. Top left: The camera is translated 5 inches along each of the local (X,YZ) axes. Top right: The vertical field of view is increased from 116 degrees to 126 degrees.. Bottom left: The camera is rolled 24 degrees around the z axis. Bottom right: The camera is rotated 2.4 degrees to the right. . . . . . . . . . . . . . . . . 24 2-5 Left four images: Projection of a 3D model onto four silhouettes for rough calibration estimates. The 3D model projection is in white and the silhouette is in light gray. Right four images: Projection of 3D model after refinement algorithm is run. The 3D models are projected back almost exactly to the silhouettes. 2-6 . . . . . . . . . . . . 29 Left: the reconstructed "visual hull" for the original set of cameras. Middle: Reconstruction after cameras have been randomly changed. Right: Reconstruction after 500 iteration of the "Volume Reprojection" refinement algorithm. . . . . . . . . . . . . . . . . . . . 32 2-7 Left: Original image. Middle: Reconstruction before refinement. Right: Reconstruction after refinement. 3-1 . . . . . . . . . . . . . . . . . 32 The figure on the left is the difference image without the correction to the gain. The figure on the right shows the result after automatic gain control correction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 7 3-2 A one dimensional max flow minimum cut example. (i) An example of a 1-dimensional graph with edge weights and flow direction noted. (ii) The flow through each edge. (iii) Vertex/pixel labeling which corresponds to the min-cut. The figure also shows the the edges which are cut. The cut separates the graph source and sink, so that there is no path between the two. .................. 3-3 40 (i)The function describing err(H), err(S) and err(V). (ii) The function describing errshad . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3-4 Upper-left: Original Foreground Image. Lower-left: Image describing error from the background. k1 = 5, k2 = 5, k3 = 20. Upper-right: "Morphological" result. Lower-right: Graph cut result. C1 = Foregroundweight = 8, C2 = Edgeweight = 12. The graph cut method does a good job filling in the holes and preserving most of the foreground boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3-5 Background images for following figures. . . . . . . . . . . . . . . . . 46 3-6 Upper-left: Original Foreground. Lower-left: Image describing error from background. k1 = 5, k2 = 5, k3 = 20 Upper-right: Image describing error for non-shadow regions. k1 = 5, k2 = 5, k3 = 20, k4 = shadowterm weight = 1.15. Lower-right: Graph cut result. C1 = Foregroundweight = 10 , C2 = Edgeweight = 24. I think this result is particularly exciting because it finds bounded regions for both foreground and shadows. . . . . . . . . . . . . . . . . . . . . . . 47 3-7 This set of figures has the same original as the set in figure 3-6. Each of the four images have a different edge weight and foreground weight. Upper-left: Foregroundweight= 10, Edgeweight = 0. Lowerleft: Foregroundweight = 18, Edgeweight = 10. Upper-right: Fore- groundweight = 10, Edgeweight = 50. Lower-right: Foregroundweight = 18, Edgeweight = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 8 3-8 Graph cut method on another image. Upper-left: Original Foreground. Lower-left: Image describing error from background. k1 = 5, k2 = 5, k3 = 20 Upper-right: Image describing error for nonshadow regions. k1 = 5, k2 = 5, k3 = 20, k4 = shadowterm weight = 1.15. Lower-right: Graph cut result. C1 = Foregroundweight = 10, C2 = Edgeweight = 24. 3-9 . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Top: Six out sixteen views which were used to generate the following 3d models. Middle: Silhouette extraction using the graphcut method. Bottom: Silhouette extraction using the graphcut method with shadow removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3-10 Top: Two Views of the model generated from the silhouettes where shadows were taken into account. Bottom: The same views as above, but without shadows removed from the silhouettes. . . . . . . . . . . 51 3-11 Comparing graph cut method with "morphological" method. Upperleft: Hand segmented foreground. Lower-left: hand segmented overlayed on graph cut results. Upper-right: "morphological" result. Lower-right: hand segmented overlayed on "morphological" resu lt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4-1 Top left: Points on an image. Rays constructed using the camera center and the points on the image. Bottom: A set of rays projected into space at the boundary of the silhouette.. 4-2 . . . . . . . . . . . . . . 56 Ground truth voxel reconstruction of a synthetic cylinder (left), silhouette intersection reconstruction (right), and our reconstruction (bottom ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4-3 Left: Eight of the 16 images captured within the Capture Cube. Right: Silhouettes computed from these images. 9 . . . . . . . . . . . . . . . . 64 4-4 Three reconstructions from the images show in Figure 4-3. Top left: reconstruction using our method. Top right: reconstruction using silhouette intersection (silhouettes were computed using image differencing, and morphological operations to remove noise). Bottom: robust silhouette intersection, where a voxel is consider occupied if 3 out of 4 cameras agree that it is within the silhouette. . . . . . . . . 65 4-5 Top: four of the 16 views captured. Bottom left: one view of our reconstructed volume. Bottom right: another view of the reconstructed volum e. 5-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 The reconstructed virtual world with cameras and image planes. Note, the figure is not in the center of the reconstructed area. . . . . . 69 5-2 Several views of a reconstruction. Cones originating from two of the cameras are shown to demonstrate volume intersection. . . . . . . . 70 5-3 Original images used for silhouette in 5-4. . . . . . . . . . . . . . . . . 70 5-4 The 12 silhouettes used to reconstruct the model in 5-5. . . . . . . . . 71 5-5 Six views of a reconstructed model. 5-6 Silhouettes from these 12 images were used to reconstruct the model . . . . . . . . . . . . . . . . . . . 71 in 5-7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5-7 Several views of a 3D reconstruction. . . . . . . . . . . . . . . . . . . . 72 5-8 Our first 3D reconstruction from two views. Only six cameras were used to generate this model. Notice, in the image on right, that there are not enough cameras to carve away part of the volume in front of the torso region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5-9 Images from a 3D "frisbee". One of the frames captures the frisbee in m id air. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5-10 Images from a 3D movie of a person lifting a box (Part I.) . . . . . . . 75 5-11 Images from a 3D movie of a person lifting a box (Part II.) . . . . . . 76 5-12 Images from a 3D movie of a person lifting a box (Part III.) . . . . . . 77 5-13 Images from a 3D movie of a figure dancing. . . . . . . . . . . . . . . 78 10 A-1 A well segmented silhouette with a line overlayed indicating the direction of run length encoding. . . . . . . . . . . . . . . . . . . . . . 81 A-2 (a) A set of worldlines mapping onto a virtual image plane; (b) a close of this camera, worldline system, where the right most imageline matches up with one of the worldlines . . . . . . . . . . . . . . . . . . . 81 A-3 Top: Three camera views, a reconstructed figure and a worldline. . . 83 A-4 Three runs corresponding to the left, middle and right camera views from figure A-3. The run on the right is the intersection of those three runs. ....... .................................. 83 C-1 Two images used to calculate the radial distortion correction. Notice the clear bowing effect in both images due to the effects of radial distortion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 11 Chapter 1 System Overview 1.1 Introduction This thesis presents a system which is able to generate 3D models in real time within a multiple camera environment. Given the observations from 16 CCD (charged coupled device) cameras, it is possible to create an accurate computer model for the shape of objects within the target space. For example, using the raw data seen in figure 1-1a, the test system was able to build the model seen in figure 1-1b. Figure. 1-la: 8 of 16 images taken at a particular time instant to construct a 3D model 12 Figure 1-1b: 3 views from the 3D reconstruction Figure 1-2 displays the model generated 30 ms later with the next set of images. Figure 1-2: 3 views from the 3D reconstruction at the next time instant The prototype system uses sixteen images, of size 320x240 pixels, connected to a 100 Megabit network to collect the raw data. The data is processed at a central computer which builds the 3D model. The system requires a highly accurate calibration of the cameras must be performed. Chapter 2 discusses our approach, including a new method for refining a rough calibration estimate. The 3D reconstruction algorithm is based on the silhouette intersection algorithm discussed in section 1.2. The critical first step is the construction of accurate silhouettes. Chapter 3 discusses a fast implementation which resolves many of the errors that make this a difficult problem. A new approach for creating 3D models is presented in chapter 4. Although related to the silhouette intersection algorithm, this new approach does not require silhouettes for computing a 3D model and can potentially be faster and generate better results. 13 For completeness, prior work on a novel approach to reconstruction, which allows the system to run in real time, is discussed in appendix A. Chapter 5 show some great results. Real-time 3D model generation is an exciting new area of computer vision research made possible by increased processor power, network speed and reduction in cost and availability of imaging hardware. In the near future, hardware specialists promise that cameras can made very cheaply. Multi-camera systems will become ubiquitous and will change how we view streaming media. A complete system that generates real-time 3D models presents many challenges. This paper focuses on calibration, silhouette extraction and model computation. It also touches upon data transfer and display. A set of solutions to these subproblems is presented as well as results from a real time system. There exists great potential for real-time 3D reconstruction in modern media. With a 3D model of a particular scene, it is straight forward to generate an arbitrary view of that model. Using the raw image data, it is also possible to place a texture on the 3D geometry creating a very realistic scene. In one application, a user could maneuver through a realistic 3D world in which they otherwise could not be present, such as a sporting event, a lethal radioactive environment, or the inside of a running motor. A system similar to that described in this thesis would enable this to happen. The ability to render a view from a real environment can provide a user with interactive control of the view as the environment changes in real time. Another application is interactive 3D TV. The current approach for bringing video to the masses requires a professional director to dictate what view is seen and for what duration. Streaming a set of 3D scenes to a television set has the potential to move that choice from the television station to the home, providing the viewer with an optimal experience. 14 1.2 Silhouette Intersection The algorithms developed in this thesis draw their inspiration from the silhouette intersection algorithm (Laurentini, 1994; Szeliski, 1993). Because of its importance and relevance to our system, I want to describe the algorithm's merits and shortcomings before delving into the details of our system. Creating a 3D model using the silhouette intersection approach requires that the foreground object be segmented out as a silhouette. The volume is then determined by the silhouette boundaries from each of the contributing images. This approach has a basic limitation, however, because the volume computed from silhouette data will rarely exactly correspond to the "true" volume. With an infinite number of cameras, the volume will converge to what is termed the "visual hull." I will discuss the two common schemes for computing the volume of an object. Although, these methods differ from the algorithm we use, they will help us gain intuition into calibration and how our method of model creation effectively employs the silhouette intersection approach. In the first method, each voxel 1 is projected back onto each of the silhouettes. A parameterized description of a camera, discussed in chapter 2, is used to calculate the projection from a voxel in 3D to a pixel in the 2D image. If the voxel projects into the silhouette in every image, then this voxel is marked as part of the volume. This method depends on the resolution of the voxelated space. For example, with this approach, a voxelated space whose dimensions are 10xiOxiG, will result in a very coarse representation of a 3D model. The second method is comprised of intersecting the volumes of cones. Each cone is determined by its base, the silhouette on the image plane, and its vertex, the origin of that camera. Although this method avoids voxel resolution problems, in practice it is difficult and computationally expensive to find the intersection of the volumes analytically. The results of analytic intersection also depends on how the volume will be represented and how a discrete set of pixels will define a 1A voxel is a discrete unit of volume usually equivalent to the shape of a cube. 15 Figure 1-3: Left: A volume constructed from one silhouette and one camera. Right: The intersection of four volumes - two of which are shown to scale. silhouette. Figure 1-3, illustrates the basic volume intersection algorithm. Although the cone extends infinitely away from the center of the camera, it is sectioned in this figure for illustrative purposes. One can think of the silhouette intersection algorithm as a volume carving procedure. The volume is initialized to be completely filled. Each camera's silhouette will then cut away the portion of the volume that it doesn't need. Figure 1-4 shows how the addition of each camera carves away more of the volume. With this in mind, and with a limited number of views, we can use heuristics to strategically place the cameras so that more of the volume can be carved away than a purely random choice. For example, if two cameras are placed opposite from each other, with a subject in between them, then both cameras will cut away the same volume (except for perspective effects.) In this case, nothing is gained by using a second camera. Figure 1-5 helps visualize this scenario. The volume at the top left of the figure was constructed using silhouettes from two cameras which are opposite to each other. Because they are effectively cutting away the same volume, the shape does not appear to be a discrete box, but rather an object that extends indefinitely on either end. (The volume has been cropped to fit in the figure, but effectively extends to the right and left to the camera centers.) The volume on the top right in figure 1-5 was constructed using silhouettes from cameras which are 90 degrees 16 Figure 1-4: Volume carving example. Top left: Volume is filled. Top middle: One cameras effect on the volume. Top right: 2 cameras. Bottom left: 3 cameras, Bottom middle: 8 cameras. Bottom right: sixteen cameras used to carve the volume. apart. The box shape begins to become distinguishable in this volume. The volume on the bottom of figure 1-5, was constructed with three cameras - the union of the cameras used in the volumes above it. If all three cameras are on the same plane, then the addition of a third camera which is opposite to one of the others does improve the shape. This is due to perspective. If the cameras opposite each other were sufficiently spaced such that they could be approximated by an orthographic projective cameras, the third camera would not cut any more of the volume. Figure 1-6 demonstrates a potential drawback to the silhouette intersection algorithm. The box on the left side of the figure contains a concavity. There will be no angle or position where a camera can placed so that its silhouette will notice this concavity. Because the volumes are constructed using silhouettes, the concavity will never be accounted for. The box on the right side of the figure has a hole from one side of the box to the other. There will be a number of cameras whose silhouette will pick up this hole. In this case, the volume generated with the silhouette intersection algorithm will have a hole through the box. 17 Figure 1-5: Top left: Volume constructed with two cameras which are opposite each other. Top right: Volume constructed with two cameras which are at a 90 degree angle from each other. Bottom: Volume constructed with three cameras, two opposite, and one 90 degrees from both of them. Figure 1-6: Left: Box with concavity. Right: Box with rectangular hole. 1.3 Overview There are two incarnations of our system. In both, camera setup and calibration are necessary preliminaries. In addition, for each camera, there is a child process with an associated frame grabber board. A parent process manages all of the child processes. For example, it synchronizes the time a frame is captured for each of 18 the child processes. In system 1, each child process grabs a frame and constructs a silhouette. It then compresses the silhouette and passes the data to the parent process. The parent process merges the data into a 3D model and displays it. In system 2, each child process grabs a frame and generates a difference image. The difference consists of subtracting the current frame from an image that was taken before the object was present. The difference images are sent to the parent process, which merges them using the method discussed in chapter 4. The models are then displayed. This method does not currently run in real time, but real-time performance is expected after further investigation. In the following section, I will describe the physical setup to give the reader a better feeling of how the data for the reconstructions is obtained. I will then give a brief overview of the high-level software components in section 1.3.2 which tie the hardware pieces together. 1.3.1 Hardware We use 8 computers each equipped with 2 frame grabber boards a piece. The number of cameras is not limited by the algorithms chosen, but rather availability. The computers are connected over a 100 megabit ethernet. We have not added any special lighting or a scene backdrop to make processing easier. Sixteen cameras are mounted around a volume approximately 3 meters wide, 3 meters deep and 2 meters high. The cameras are not constrained to be in any particular location, but are positioned so as to sample the space as uniformly as possible. More specifically, our volume is defined by a physical structure whose dimensions are given above. Four cameras are mounted on the top edges of the structure, four are put in the middle of the corner posts and eight cameras are positioned on tripods evenly distributed on each of the sides. All cameras approximately point to a central area in the middle of the volume the size of person. As the discussion in section 1.2 concluded, more geometry can be recovered if the cam- 19 Figure 1-7: A 3D volume acquisition area set up in the lab environment. The acquisition volume is roughly 2 meters square, and is surrounded by 16 cameras. The camera which are visible in this image are highlighted with red circles. A figure in the center of the Capture Cube is in motion while the cameras are taking synchronous pictures. eras on opposite sides do not mirror each other. This has been taken into account when placing our cameras. We call our space the Capture Cube. 1.3.2 Software We use MPI (Message Passaging Interface) as the communication interface for transferring data over the network. The sixteen cameras simultaneously capture images at a resolution of 320x240 pixels. The resolution of images has been reduced from the original size of 640x480 pixels to enable faster computation times. The images are then sent to a central computer, which computes the 3D model from the data using a new silhouette intersection algorithm, described in appendix A, which tightly integrates compression and speed into the reconstruction. We also discuss another technique in chapter 4 which computes a 3D volume without first needing to compute the silhouettes. 20 Chapter 2 Calibration 2.1 Introduction The goal of calibration is to geometrically coordinate images taken from many cameras. Calibration allows us to correctly locate the position and boundary of the volumes that correspond to each of the cameras. This prepares the system to run the volume intersection algorithm, which finds the intersection of spatially coincident volumes. Figure 2-1 shows the importance of calibration in terms of volume intersection. When the a camera is not calibrated, the cone originating from this camera might not have any intersection with the object volume. Figure 2-1: A camera which has been rotated 17 degree from its "true" position. The cone originating from the uncalibrated camera (in another color) is not coincident with the object at all. For comparison, the cone from the correctly calibrated camera is also shown. 21 For simplicity we model each camera as a pinhole camera. This model allows enough accuracy given the quality and resolution of the cameras. For example, our silhouette computation potentially has errors of 1-2 pixels around the boundary of silhouette. This noise offsets improvements that could be made with a more complex camera model. We also save computational resources by not needing to use a camera model with additional parameters. The camera model can be described by seven parameters, three for location, three for rotation and one for focal length. The seven parameters of each camera must be found with respect to a common coordinate system. This problem is easy to describe but difficult to solve - due to its non-linear nature. Section 2.2 discusses our method for calibrating the cameras. It emphasizes our new approach for refining an approximate camera calibration. A projective camera is typically constructed in a local coordinate system whose origin is at the camera center and with the z axis pointing towards the image plane. The focal length and the size of image determine the field of view. Figure 2-2 gives a two dimensional example. Using this figure as a reference, we can find the projection of the 2D point X 1 , Z1 onto the 1 dimensional image plane, x 1 = f x1 z1 using similar triangles. The extension to three dimensions is straight forward. If we designate, F as the position on the image plane and I as the X, Y, Z position in space then we have i= fR Image Plane . (XI, Zi) Camera center Ba f .................................. Figure 2-2: Illustration of the projective geometry of a camera in 2D. Figure 2-3 illustrates the geometry of a camera system. The center of the camera is marked with a sphere. The local z axis extends from the camera center to the center of the image plane, which has been inscribed with the outline of the sil22 houette. Lines formed by connecting the center of the camera with the corners of the image plane represent the field of view. The global (X,YZ) axis is located at the bottom right of the figure. The location of the center of the camera along with the direction of the local z axis represents the cameras six external coordinates in respect to the global coordinate system. The seventh camera parameter determines the field of view. Figure 2-3: A three dimensional camera scene. The camera model determines how rays project from the camera into 3 dimensional space, or conversely how voxels in space project back to the image plane. The projection operation is integral to the silhouette intersection algorithm and this leads to the necessity of accurate calibration. Intuitively, figure 2-4 shows how a projected ray will shift depending on the values of the camera parameters. In response to adjusting camera parameters, the perceived location of the object, associated with a pixel will also shift. A point on the surface of the 3D model is chosen in figure 2-4 and projected back to the image planes of each the camera. The camera center and the projected pixel location for each of the four cameras determines a ray. When the cameras are correctly calibrated the rays constructed from the projected pixel locations of each of the cameras will meet at the original 3D location whose projection determined the ray construction. The right most camera in each of the sub-figures of 2-4 is moved from its "true" calibration, by adjusting each of 23 the camera parameters in turn. To illustrate the effect of the miscalibration the ray is still constructed using the same pixel. Figure 2-4: Camera parameters of the right most camera are adjusted while the pixel used to construct the ray remains fixed for comparison with the "true" calibration. Top left: The camera is translated 5 inches along each of the local (X,YZ) axes. Top right: The vertical field of view is increased from 116 degrees to 126 degrees.. Bottom left: The camera is rolled 24 degrees around the z axis. Bottom right: The camera is rotated 2.4 degrees to the right. Camera calibration has received a lot of attention due to its critical role in geometry related vision algorithms. See (Stein, 1993) and (Tsai, 1987) for reviews of calibration algorithms. Many algorithms require only rough calibration for which there are many solutions. 3D model reconstruction requires exact calibration. Several solutions have been proposed including solving for the camera parameters by corresponding points with known 3D locations. Often a checker board pattern is 24 used because it is easy to extrapolate 3D point locations given its regular structure. Problems occur because the camera model is inherently nonlinear and therefore difficult to directly solve for. Attempts have been made to make the problem linear (Faugeras, 1993) and solve for groups of parameters individually. These parameters are typically divided into two groups: external and internal. The external parameters correspond to camera position and rotational orientation with respect to a global coordinate system. The internal parameters consist of all the parameters which are intrinsic to the camera, such as focal length, scale (eg. are the pixels square, rectangular or skew), and radial distortion. Stein (Stein, 1993) discusses a method for internal calibration using the properties of geometric objects and camera rotation to solve for some of the internal parameters. Zhang (Zhang, 1998) uses a set of checker board patterns defined on different planes to solve both internal and external parameters. All these methods can still suffer from inaccuracies and time consuming manual intervention. 25 2.2 Calibration through Reprojection 2.2.1 Introduction Our cameras are calibrated using a two step process. First a rough calibration is estimated using fiducial correspondences. 1 To improve on the rough calibration, we have developed a novel algorithm, which does not depend on fiducials, to refine the calibration estimate. Our approach requires only the availability of a silhouette from each of the cameras. Our algorithm has an additional advantage in that it can dynamically update calibration once the system is running. Although still in the initial stages of investigation, the results from this new algorithm look very promising. 2.2.2 Calibration Estimation Fiducials are located by either manually clicking the mouse on their 2D locations in images, or by using computer vision techniques to automatically locate them. We have tried using several type of fiducials and have discovered each has its own problems. First we laser printed colored circles, which could be uniquely identified by the combinations of colors. For example, one fiducial had a yellow circle within a red circle. Unfortunately, our CCD cameras do not sense a wide range of colors. More importantly, the cameras came with the supposed feature of "automatic white balance" that could not be turned off. This feature would change the red, green or blue components of a pixel based on an unknown function of all the pixels in the image. Because the colors of an object depended on current image conditions, a simple color thresholding algorithm had a hard time locating all the fiducials. Also, the centers of the fiducials we did find, were difficult to locate accurately. We also also tried using colored pieces of wood whose dimensions in inches were 1x2x36. The idea behind this method was that because the sticks had greater coverage, finding the lines they drew out in the image would be less sus1We also account account for radial distortion as described in appendix C. 26 ceptible to noise. However, their wider coverage actually became a disadvantage. It was difficult to separate the lines that corresponded to each stick because some were hidden or obscured by other sticks in front of them. From our initial experiments, we found the need for a set of robust fiducials. To this end, we have finally developed a satisfactory method for locating fiducials. We move a set of lights to known 3D locations. Because lights are easy to identify in images, we are able to get accurate correspondences. The 3D locations of the fiducials are measured ahead of time. After we have found the fiducials in each camera's image, we have a correspondence between the 3D points and the 2D points they project to on the image plane. With this information on hand, we run an iterative algorithm to determine the camera parameters. The algorithm is run for each camera separately. We initialize the algorithm with a guess for the camera parameters. Using this guess, we project the 3D points onto the image. We have devised an error function based on our knowledge of where the points are supposed to be projected. We want to find a set of parameters for a each camera which minimizes the following summation. N min Z(xp - x')2 + (yP - y'")2 (2.1) There are N 3D points. 3D points projected onto the image plane are labeled with a superscript p and the points they correspond to on the image are labeled with a superscript m. We adjust each of the parameters randomly and project the 3D points back down to the image plane. If our minimization criteria improves, we keep this new set of parameters and continue to iterate until the summation reaches a minimum threshold or does not improve anymore. On average, there will be a 1-2 pixel error between the projected point and the identified fiducial location in the image. There are two errors involved in the fiducial calibration process. First the 3D locations of the fiducials are not known exactly and second our estimates of the 27 locations of the fiducials in the 2D image plane are prone to error. Given these errors, only a rough estimate of the camera calibration can be made. 2.2.3 Volume Intersection and Reprojection The "Volume Reprojection" approach to calibration allows us to refine a rough calibration estimate. The inspiration for this approach came from the observation that a volume constructed using the volume intersection algorithm should project back onto each of the silhouettes exactly. A silhouette is the region where an object projects to the "real" camera's image plane. If the cameras are correctly calibrated, then the camera model's projection operation should have the same mapping of 3D points to 2D points as the "real camera". Therefore the projection of the 3D model will be equivalent to the projection of the "real" object and thus create the same silhouette. The four images on the left hand side of 2-5 show a reprojection of a 3D model back onto silhouettes for cameras which are not calibrated correctly. The reprojected pixels are in white and the silhouette pixels are in light gray. After running the algorithm for several thousand iterations, the projection of the 3D model back onto the image plane matches the silhouette. When this condition occurs, the algorithm halts and outputs calibrated camera parameters. The model generated with this calibration has been used in many figures throughout this thesis. The algorithm is initialized with a set of rough camera estimates. A 3D model is generated using the silhouette intersection algorithm described in section 1.2. Each camera will carve away a portion of the volume that is doesn't need. If the cameras are miscalibrated, then the camera will carve away a portion of the model that is supposed to be there. The 3D model is projected back onto the each of the image planes with the silhouettes. Because some of the cameras have carved away part of the model that actually should have been there, the projection of the model will not match the silhouettes. In fact, the projection will always be contained within the silhouettes. This fact allows to come up with an evaluation function for the 28 Figure 2-5: Left four images: Projection of a 3D model onto four silhouettes for rough calibration estimates. The 3D model projection is in white and the silhouette is in light gray. Right four images: Projection of 3D model after refinement algorithm is run. The 3D models are projected back almost exactly to the silhouettes. goodness of calibration. Let p be the number of pixels within the silhouette. Let r be the number of pixels in the projection of the 3D model onto one image, and let M be the number of images. Our evaluation criteria C is simply a count of the number of pixels in the reprojections. M C Z rm (2.2) m=1 For greater values of C, calibration is more accurate. The algorithm is summarize with the following steps. * Inputs: rough camera estimates and silhouettes " (i) Construct 3D model with silhouettes and camera models. * (ii) Project the 3D model back to each of the image planes * (iii) Evaluate C, our criteria function. * Loop - Change each of the camera parameters by a small amount Aim. 29 - Do step (i),(ii) and (iii). - If Ce, > Cold then update camera parameters with Am " Stop when C no longer improves or after a given number if iterations. " Save calibrated cameras The first steps of the algorithm find and initial value for C. This is done by constructing a volume with the original camera models and then projecting back to the images and evaluating C. Each of the camera models parameters are then updated by a random amount Ar where the subscript i enumerates the seven camera parameters, and the superscript m enumerates each of the cameras. Because each of the camera parameters have a different units, they are updated by a different amount. The 3 cameras position parameters, relative to a global coordinate system, are in units of inches, with the updates, A 1 3 , randomly chosen within the range of (-0.002 to 0.002). The updates, A4 _6 , for the 3 orientation parameters are taken randomly from the range ( -0.05 to 0.05 degrees). Finally, A 7, for the field of view parameter is taken from the range ( -0.05 to 0.05 degrees). I make two adjustments to the parameters of the algorithm for better performance. First, if C doesn't get better after 50 iteration, I multiply the updates Ai by 0.8. By lowering the values for the parameter updates, we are able to hone in on the correct calibration. Lower values for Am help avoid the possibility that our steps are too big and are jumping over an improved calibration in parameter space. However, smaller steps can make the algorithm take longer to reach the "best" calibration, especially if the starting point is far enough away such that bigger steps could have been taken. As with any gradient descent type algorithm, there is an art in choosing the update steps. The other enhancement I make is to estimate a gradient direction when there is an increase in our evaluation criteria C. When C increases, I use the Ai's from each camera as an initial direction vector for the next set of Ai updates. To construct the models, I use the voxel projection approach with a relative small grid spacing of 0.5-1.0 in. For the reprojection stage, I need to ensure the 30 each pixel will be set if it falls within the projection of the voxelated model. Depending on the grid spacing and the orientation of the camera, simply projecting voxels back to the image will potentially miss interior pixels. Therefore, to find the reprojections, I cast rays out from each silhouette pixel and test whether it intersects the 3D model. Initialization is important because the algorithm relies on generating a 3D model. If the cameras were randomly initialized, it would be very unlikely that all the cones, projected from the silhouettes, would intersect. Therefore, the algorithm would never be able to improve the match between the reprojections and the silhouettes. 2.2.4 Experiments We have tried this approach with both synthetic and real data. The synthetic data was generated using 16 known cameras imaging a virtual cylinder which was 16 inches long and 14 inches in diameter. To approximate the same geometry of the Capture Cube space, we used a set of cameras models which had been calibrated for a live run of the system. The parameters of each of the cameras were disturbed randomly. The translation parameters of each camera were disturbed randomly within the range of (-6 to 6 inches). The three rotational parameters were disturbed within the range of (-0.1 to 0.1 radians). And the horizontal field of view parameter of each camera was randomly adjusted with a value between (-5 to 5 degrees). By adjusting the parameters within these ranges, we simulated a very rough initial camera calibration for the data. After 500 iterations, and one hour of computation time, the 3D reconstruction improves dramatically. Figure 2-6 demonstrates this improvement. Figure 2-7 shows the results of refinement on a real data set after running the algorithm with 12 cameras for 60 iterations. It took 25 minutes to run the algorithm at one voxel per inch. The initial calibration estimate is not far off from the "true" calibration. This data set, which includes silhouettes and cameras estimates, was 31 Figure 2-6: Left: the reconstructed "visual hull" for the original set of cameras. Middle: Reconstruction after cameras have been randomly changed. Right: Reconstruction after 500 iteration of the "Volume Reprojection" refinement algorithm. used because it is typical of the data obtained from the Capture Cube system. The refinement algorithm also improved the calibration on other real data sets, which intially had rougher initial camera estimates. The parameters are adjusted using the values described in section 2.2.3. The region around the hand is emphasized to show the improvements in the object shape made by the "Volume Reprojection" algorithm. Figure 2-7: Left: Original image. Middle: Reconstruction before refinement. Right: Reconstruction after refinement. 2.2.5 Discussion Our approach has several advantages over conventional approaches. First no correspondences are necessary. It works with arbitrary base-lines. It is relatively fast. It can be used without a particular type of calibration target (an arbitrary target 32 works). It can be left to operate during other processing (3D reconstruction) which allows the system to dynamically recalibrate itself. Some may find it a drawback that you need many cameras to run this algorithm. However, it is generally much more difficult to calibrate many cameras as opposed to one or two cameras. Our calibration scheme addresses this difficulty. From our experimental section, we also noticed that with more cameras, the algorithm will converge to a truer calibration because it represents a more unique "visual hull". There are two potential downsides to this approach. First the algorithm is time consuming. It runs in approximately 5-10min to calibrate six cameras whose initialization is relatively close to the true calibration. For a larger number of cameras and an initialization which is farther away from the true calibration, the algorithm will take much longer, on the order of hours. Both model generation and reprojection are time consuming. To get around this problem, I start by decimating each of the images by a factor of 4. By doing this, I can lower the resolution of the voxel space by a factor 4 and the number of pixels that fall within the silhouette are reduced by a factor of 4. The effect is eight times increase in speed. However, you do loose accuracy as you lower the resolution of your space. When the calibration gets close, the pixels at the edges of the silhouette are the most important because they will be the focus of where our evaluation criteria C can improve. At lower resolutions, the boundary pixels are left out or merged with their neighbors. Therefore, as the algorithm gets closer to the "true" calibration, I increase the resolution. The other potential drawback is that the calibration algorithm relies on the fact we are able to obtain good silhouettes from images. This has proven to be difficult with cheap cameras and uncontrolled lighting and background. We believe this calibration method will have an important impact on multicamera systems. Even though the inspiration of this method was derived from silhouette intersection, this approach can be used to calibrate any other multi-camera system. This is significant because it will allow further research in this area and successful multi-camera systems. 33 Chapter 3 Silhouette Construction 3.1 Introduction A crucial component for an integrated real-time 3D geometry system is the construction of accurate silhouettes. Accurate models are essential for volume construction and calibration (see section 2.2). Speed is essential for a real time system. This chapter explores a fast method. Segmenting foreground from the background is an important task for many computer vision applications - from object recognition to tracking. The naive silhouette intersection algorithm will generate holes in the 3D model when there are holes in the silhouette. An ideal algorithm would identify smooth bounded regions without holes as the foreground object, and yet still preserve holes when they are actually present. This type of algorithm would be ideal beacuse objects that we wish to reconstruct, eg. people, do not have jagged boundaries, are not semi-translucent and are not full of holes. 3.1.1 Preliminaries Before constructing silhouettes there are several issues we need to address. Typically the first step in silhouette construction is to do background subtraction. This requires, from each view, an image be taken before the object is present to get a 34 baseline for the background. Then another image is taken when the object is in the scene. When the difference between the two images is evaluated, the foreground object should be prominent and associated with high difference values. If the lighting or any other factor that effects the camera changes in the time between taking the foreground and background images, it will be very difficult to identify the foreground object. ilk Figure 3-1: The figure on the left is the difference image without the correction to the gain. The figure on the right shows the result after automatic gain control correction. Automatic Gain Correction Inexpensive cameras, such as the Intel Create and Share models we used, have automatic gain control which can not be turned off. If a foreground object moves into the scene and makes the entire scene brighter or darker, the camera will adjust the pixels by an unknown gain function. To compensate for this I solved for two variables a and 3 that could account for this variability. I want to find a and such that (f F - aBi + p)2 # = 0. Using a least squares approach I am left with two equation and two unknowns which can be summarized as: [zF L F Fzij 1 13 35 FiBi Bi (3.1) Interlacing The Intel cameras also output a standard NTSC signal with interlacing. This can be problematic for silhouette extraction because areas of motion will appear to be stripped, making it difficult to determine an accurate boundary. Each frame that is picked up by the frame grabber contains data from two 640x480 frames at 1/60sec apart. The data in each row alternates between coming from each of the two frames. Merging these frames, gives a 640x480 frame approximately every 1/30sec. By decimating the image to 320x240, we avoid the interlacing problem. 36 3.2 Silhouettes through Graphcuts 3.2.1 Introduction In this section I present a method of constructing a silhouette from difference images. I also account for shadows which are not strictly foreground or background. The method is based on talk given at MIT by Ramin Zabih and the associated paper (Boykov, Veksler and Zabih, 1999). Their approach allows you solve (for some cases) or approximate the minimization of an energy function by constructing a graph, where the edge weights are terms from the energy function, and the minimization corresponds to the min-cut of this graph. For a binary labeling the algorithm runs in close to 3 sec. on a 320x240 image. 1 3.2.2 Energy minimization and Graph Cuts The goal of the algorithm proposed in (Boykov, Veksler and Zabih, 1999) is to pose an energy minimization as a graph problem, and then solve using fast graph methods. In this formulation, the energy function consists of a data term and smoothing term. E = Edata + Esmooth (3.2) The graph cut method for energy minimization can be applied to any vision problem that can be defined using the above energy function. In this section I formulate the problem for silhouette extraction and show how the energy function can be minimized using the graphcut method. I include some notational details in order to give an overview of the result in (Greig, Porteous and Seheult, 1989), which links energy minimization to the min cut on a graph. In section 4.2, I apply the graph cut method to energy function in voxel space. The same arguments 'This work was first developed for the final project of the Robot Vision class and extended for integration with the Capture Cube system. 37 which link energy minimization to a minimum cut on a graph still hold on for this graph which has a three dimensional structure. A detailed description requires defining a number of variables more formally. I will use notation similar to that of (Boykov, Veksler and Zabih, 1999), although in a simplified context. The graph G(V, E), associated with the energy function defined above has P + 2 vertices whose edge weights correspond to Edata and Esmooth. The vertices consist of the P pixel locations and a source and sink vertices which correspond to opposing labels. For the case of a binary silhouette, the labels given to pixels will correspond to the two states, object or background . If we have more then two states for a pixel, then P will consist only of the pixels that are currently under consideration. f will represent the current pixel labeling and f will be the new labeling after the minimization is found for the set of pixels currently labeled a or 3. This minimization over the set of pixels labeled with a or 3 is called an a - 3 swap. Let C on G(V, E) be the set of edge weights which are cut. The cost of a cut C is equal to the cost of cutting edges to the source and sink plus the cost of cutting edges between pixels. Define p to be a pixel location and tg to be the edge weight between the pixel and the source and similarly tO to the sink. Also define the edge weight between pixel p and q to be ep,q}. The weights aligned to the edges are D(a) on to, D()) on tO and I'V(a, /3) as the edge weights e{p,q} . D(a) and D(,3) are the data terms and V(a, 3) is the smoothing term of the energy function. I will discuss my choices for these functions in 3.2.3. The paper (Boykov, Veksler and Zabih, 1999) shows that energy function is equivalent to the cost of the min-cut of the graph plus a constant. The cost of a cut C is defined as ICI = ICnep,q)1 C n {t, t0}1+ {p,q}E N PEPa, {p,q}CP,,3 38 (3.3) and it is shown that this is equivalent to JCJ D(fP)+ = pEP/ E V{p,qI(ff ) (3.4) {p,q}EN {p,q} CP,,g Now to bridge the final gap, E = JCJ + k (3.5) and k is equal to the energy from the pixels which are not labeled a or /3 The new labeling j falls out from the graph cut. Start at the source and follow every path whose edge weight is not saturated from the max flow computation. Every vertex, which corresponds to a pixel that is reached, is labeled by the corresponding sink label. Otherwisethe vertice is labeled with the source label. See figure 3-2 for a one dimensional example. For a problem with only two labels, an energy function can be minimized with one graph cut. This is accomplished by simply setting up the graph with associated weights and then minimizing by finding the min-cut. For more then two labels, the algorithm consists of a loop. (L is the set of all labels). The loop exits when the total energy does not go down. This occurs when a swap move between a pair of labels does not improve the energy. In (Boykov, Veksler and Zabih, 1999), they claim the loop terminates in a finite number of cycles and experimentally this has not been proven incorrect. 1. Start f as all pixels labeled with 1 E L 2. Begin Loop: for each pair of labels {a, ,} c L (a) Find f (b) if (E(f) < E(f)) set f = f otherwise break 3. end Loop 39 Graph SOUC G(V,E) Max Flow label a) 0 A k0 (la0 o2 X b) 3V 0 0 sink (label b) (i) Labeling (ii) aQ aO _aO aO7 (iii) Figure 3-2: A one dimensional max flow minimum cut example. (i) An example of a 1-dimensional graph with edge weights and flow direction noted. (ii) The flow through each edge. (iii) Vertex/pixel labeling which corresponds to the min-cut. The figure also shows the the edges which are cut. The cut separates the graph source and sink, so that there is no path between the two. 3.2.3 Implementation Introduction I have implemented the algorithm described above for two cases. First, I implemented the algorithm for a two label case. Here the labels correspond to object and background. Implementing this case required setting up the graph and minimizing with a min-cut. I also implemented the swap move algorithm for the case with three labels. The labels correspond to object, background and shadow. To set up the algorithm, I need to define an energy function to minimize. I use the difference image as a measure of foreground and a function based on intensity for shadows. 40 I show results for both the two and three label case. Shadows and color space The inputs to the algorithm are two images, one of the background and one with the same camera parameters with a foreground object present in the scene. The RGB coded images are 320x240 in size. RGB space is not a good domain to compare color values. For example, for shadow removal we want to to compare pixels with the same color but with different intensity values. A shadow pixel will have a lower intensity. With this in mind, a natural choice is to convert RGB to HSV 2, as the V component measures luminance. Choosing the energy function As in any energy minimization problem, an appropriate description of the energy function is critical for the success of the algorithm. A good energy function should both represent the problem and be amenable to the particular algorithm. For the binary case of choosing between object and background an obvious measure presents itself. Ebackground= = kl * err(H) + k2 * err(S) + k3 * err(V) D(a) Eforeground= D(/) Eedge= = (3.6) (3.7) C (3.8) V(a, ) = C2 I choose err(H), err(S) and err(V) to be truncated linear functions of the difference between (Hf - Hb), (Sf - Sb) and (Vf - Vb). As the difference value becomes larger, this pixel location is almost definitely part of the object . Large difference 2 HSV stands for hue, saturation and luminance. The luminance component can be thought of as the brightness of a particular color. For example, if the luminance component has a value of 0, then the corresponding pixel is black. A value of 1 for the luminance component will result in the full intensity of the color that is described by the hue and saturation elements. 41 Figure 3-3: (i)The function describing err(H), err(S) and err(V). (ii) The function describing errshad values mean the pixel is a very different color from the background , but is not necessarily more like the object then a pixel with smaller difference value. From this observation, truncation makes sense. Our choice of truncation level is determined by experimental observation. For the foreground edge weight, I can just choose a constant. This sets up an opposition between Ebackgrourd and Eforground as I lower CI more pixels are considered object. C2 is the smoothing factor, more of C2 means more smoothing. For three labels, I need to add a term for shadows Eshadow - k4 * (ki * err(H) + k2 * err(S) + k3 * errsshad) (3.9) err(H), err(S) are the same as before and I choose errshadto be a step function with a linear ramp between 0 and 1 (see figure 3-3). This function implies that if the colors are the same and the object pixel is darker then the background, the pixel location is likely to be in shadow. The constant k4 controls how the shadow term will interact with the object and background. 42 Max flow/min-cut The graph is constructed with the weights described above. Next I find the min-cut in a two step process. (See CLR (Cormen, Leiserson and Rivest, 1991) for max flow min cut theorem discussion.) First I found the max flow of the graph. I integrated max flow code written by Cherkassky and Goldberg (Cherkassky and Goldberg, 1997) into my system. The max flow algorithm assigns the flow weights to all the edges. The min-cut is found by identifying all the edges whose flow is at capacity To find the new labeling i, the edges to the source and sink which are cut are given the labels that correspond to the source and the sink respectively. 3.2.4 Results I present results for the two label (object and background ) case. However, I focus on the case that distinguishes between object , background and shadow . Here, I show results for two different images and for different parameter values. I also compare the graph cut method with a "morphological" method. This method starts with a difference image, thresholds the image, operates with median filter (to clean up noise) and ends by dilation operation (to fill in holes.) I threshold at a value of 20 for the difference image. I have a 5x5 kernel which computes the median filter. A pixel is declared "on", or part of the foreground, if more then half the pixels in the kernel window are on. After this operation, if a pixel is "on", I set all the pixels to "on" within a neighborhood of two of that pixel. The effect of the last operation is dilation. 3.2.5 Discussion In general the graph cut method does a very good job of finding a consistent bounded region. From figure 3-11, we see that there still some errors from the optimal (hand segmented) solution. In particular, the results from the graph cut method appear to be boxy. The bounding region does not have curved surfaces, but makes steps in the horizontal and vertical directions. This observation could 43 stem from the fact that I only connected the neighbors in the horizontal and vertical directions when I was constructing the graph. The graph cut method is also well suited for a problem consisting of more then two labels which is demonstrated with figure 3-6. The graph cut method also empirically outperforms the "morphological" method. The graphcut method runs in 2.7 seconds on 320x240 images, where most of the time is spent on organizational details such as graph management. For images which are 160x120, it takes 0.75 seconds to produce a silhouette segmentation. The max-flow computation takes 0.2 seconds. With another decimation of the image, down to 80x60, the whole computation takes 0.1 sec. With this last image size, I would be able to run the system at ten frames a second. The models resulting from small silhouettes would be accurate but lack the details. However, given the inherent inaccuracies of the visual hull representation of an object, the lack of detail could be justified. Figure 3.2.5 compares models which were generated with and without shadow removal. For this set of data, there were only three views of the leg region. ( Three views is at the limit for the number of views necessary to generate "reasonable" geometry.) In addition, there was heavy shadowing due to the position of the figure with respect to the lighting. Because of the limited number of views, the accuracy of each views silhouette is even more important. In the model generated without shadow removal, the torso appears to be mounted on a wide pillar. The model that used the silhouettes where shadows were accounted for clearly identify the presence of leg geometry. When using image differencing as part of the process of silhouette extract, shadows will always be problem where an object is close to a background surface. This is clearly the case at the location where the figures feet meet the floor. 44 Figure 3-4: Upper-left: Original Foreground Image. Lower-left: Image describing error from the background. k1 = 5, k2 = 5, k3 = 20. Upper-right: "Morphological" result. Lower-right: Graph cut result. C1 = Foregroundweight= 8, C2 = Edgeweight = 12. The graph cut method does a good job filling in the holes and preserving most of the foreground boundaries. 45 Figure 3-5: Background images for following figures. 46 Figure 3-6: Upper-left: Original Foreground. Lower-left: Image describing error from background. k1 = 5, k2 = 5, k3 = 20 Upper-right: Image describing error for non-shadow regions. k1 = 5, k2 = 5, k3 = 20, k4 = shadowterm weight = 1.15. Lower-right: Graph cut result. C1 = Foregroundweight = 10 , C2 = Edgeweight = 24. I think this result is particularly exciting because it finds bounded regions for both foreground and shadows. 47 Figure 3-7: This set of figures has the same original as the set in figure 3-6. Each of the four images have a different edge weight and foreground weight. Upperleft: Foregroundweight = 10, Edgeweight = 0. Lower-left: Foregroundweight = 18, Edgeweight = 10. Upper-right: Foregroundweight = 10, Edgeweight = 50. Lowerright: Foregroundweight= 18, Edgeweight = 50. 48 Figure 3-8: Graph cut method on another image. Upper-left: Original Foreground. Lower-left: Image describing error from background. k1 = 5, k2 = 5, k3 = 20 Upper-right: Image describing error for non-shadow regions. k1 = 5, k2 = 5, k3 = 20, k4 = shadowterm weight = 1.15. Lower-right: Graph cut result. C1 = Fore- groundweight = 10, C2 = Edgeweight = 24. 49 Figure 3-9: Top: Six out sixteen views which were used to generate the following 3d models. Middle: Silhouette extraction using the graphcut method. Bottom: Silhouette extraction using the graphcut method with shadow removal. 50 Figure 3-10: Top: Two Views of the model generated from the silhouettes where shadows were taken into account. Bottom: The same views as above, but without shadows removed from the silhouettes. 51 3.3 Technique comparison The following figure compares the results of graph-cut method with the "morphological" method. Each method's result is overlayed on top of a hand segmented result. Visually we can see the advantages and disadvantages of both methods. For instance, the "morphological" result, still leaves holes in the silhouette, on the other hand it does not start filling in the gap between the legs as the graph-cut methid does. The formulation of the graph-cut method as an energy minimization gives a theoretical foundation to build upon. For instance, edge-support can be incorporated into the graph-cut method which would bring the boundary closer to the hand segmented result. 52 Figure 3-11: Comparing graph cut method with "morphological" method. Upperleft: Hand segmented foreground. Lower-left: hand segmented overlayed on graph cut results. Upper-right: "morphological" result. Lower-right: hand segmented overlayed on "morphological" result. 53 Chapter 4 Reconstruction without Silhouettes 4.1 Introduction Voxel occupancy is one approach for reconstructing the 3-dimensional shape of an object from multiple views. In voxel occupancy, the task is to produce a binary labeling of a set of voxels, that determines which voxels are filled and which are empty. In this chapter, we give an energy minimization formulation of the voxel occupancy problem. The global minimum of this energy can be rapidly computed with a single graph cut, using a result due to (Greig, Porteous and Seheult, 1989). The energy function we minimize contains a data term and a smoothness term. The data term is a sum over the individual voxels, where the penalty for a voxel is based on the observed intensities of the pixels that intersect it. The smoothness term is the number of empty voxels adjacent to filled ones. Our formulation can be viewed as a generalization of silhouette intersection, with two advantages: we do not compute silhouettes, which are a major source of errors; and we can naturally incorporate spatial smoothness. We give experimental results showing reconstructions from both real and synthetic imagery. Reconstruction using this smoothed energy function is not much more time consuming than simple silhouette intersection; it takes about 10 seconds to reconstruct a one million voxel volume. I 'This is the abstract for (Snow, Viola and Zabih, 2000) appearing in CVPR2000 54 Up to now our approach to model generation has been based on the silhouette intersection algorithm. There are several potential drawbacks to silhouette intersection. First, it requires accurate segmentations. A group of silhouettes explicitly specifies the model; and as discussed in section 1.2 their intersection is known as the "visual hull". An error in the silhouette computation from one image can significantly effect the resulting visual hull - particularly if the boundary of the silhouette is underestimated or there are pixel errors within the silhouette itself. Pixels within the interior of a silhouette which are incorrectly labeled background will result in holes in the model. Silhouette computation often require relatively expensive computations. In practice some kind of local morphological operator (Serra, 1982) is usually applied as a cleanup phase, either to the 2D silhouetted images or to the 3D volume. In chapter 3, we explored a method for silhouette computation which could run in close to real-time. However, by avoiding silhouette computation altogether, we can potentially save some valuable time and reduce errors. In this chapter, we discuss a technique for creating 3D models without explicitly constructing silhouettes. In section 3.2, we presented a method of silhouette construction by posing an energy minimization as graph problem and solving by finding the minimum cut of the graph. In this section, we will follow a similar procedure to reconstruct a 3D model given a set of images. In our formulation, individual pixels are not set to binary values and therefore do not determine the fate of all the pixels intersected by Ray k. 2 Each pixel will contribute evidence to all the voxels intersected by Ray k. In addition prior knowledge tells us that objects are not full of holes, therefore we add a smoothness term to the energy function that enforces consistency between neighboring pixels. Ray construction is summarized in figure 4-1 Our method can be viewed as a generalization of silhouette intersection; we will show at the end of section 4.2.3 that with the appropriate data term and 2Ryis defined as the ray emanating from the origin of camera k, and passing through a pixel r, C. 55 ... 1.. Figure 4-1: Top left: Points on an image. Rays constructed using the camera center and the points on the image. Bottom: A set of rays projected into space at the boundary of the silhouette.. smoothness term, minimizing the energy is equivalent to intersecting silhouettes. 4.2 4.2.1 Problem Formulation Terminology The algorithm produces a labeling f for a set of voxels V. The set of voxels V has a binary labeling - a voxel is labeled 1 if it is within an object, or 0 for outside. The label for a particular voxel is designated as fv where v E V. The neighbors of v are N(v). We define the neighborhood as containing all the adjacent voxels. Each camera k, has an image I and associated background image Ik which is taken when there was no object present. For a pixel p, we have a difference measure from the background 6k (P) = Ik(p) - Ik(p). The sum of all the difference measures that apply to a particular voxel is designated by Ov. We will designate the set of pixels whose rays Ray,, 1,.., m, intersect with a particular voxel as Pv. 56 4.2.2 The Energy Function Our energy function has two terms: Edata and Esmooth. First, the data term D,(fv) describes how well a label corresponds to the observed data. Because we are minimizing an energy function, Dv(f,) charges a penalty for a labeling that does not coincide with the data. The exact form of the data function will be described in the experiments section. For example, if 6k (p') where p' E Pv is large, we expect fv to equal 1. In other words, if all the contributing pixels to a voxel have large differences between foreground and background then this voxel should be labeled as part of the object. Second, we have a smoothing term. The smoothing term adds a penalty to neighboring voxels which do not have the same labeling. For two neighboring voxel we have (1 - 6(f, - fi,)). Here, 6 is the unit impulse function which is 1 at the origin and 0 elsewhere. Combining the two terms we can write down the energy function. We wish to obtain the labeling f* that minimizes E(f) = D(fv) v + A vGV (1 - 6(fv - fj)')). (4.1) v'v'GAf(v) The constant A (often called the regularization parameter) controls the degree of spatial smoothness. Greater values of A enforces higher degrees of smoothness. 4.2.3 Energy minimization vs. silhouette intersection It can be easily shown that the energy minimization formulation can be reduced to that of basic silhouette intersection. Silhouette intersection does not include a term for smoothness, so we can set A = 0. Second if we enforce a binary labeling for 6 k (p'), we can define the function D,(f,) = fv, if any of the pixels p' C Pv are outside the silhouette and thus equal to zero. Otherwise, D,(fv) =1 - fv. The global labeling f*, with this choice of A and De, produce the same results as 57 silhouette intersection. 4.2.4 Graph cuts The graph setup is very similar to the 2D case described in section (3.2.2). The edge weights for the graph G(V, E) are set using the energy function defined above. There are V + 2 nodes consisting of V voxel locations plus two terminal nodes object and background which are typically labeled source S and sink T. The graph is constructed such that each node has edges connected to its six neighbors, which lie along the directions of the 3D grid, and edges connecting it to the source and sink. The edge weights between neighboring voxel nodes is A. This value is independent of observed data. The edge between the voxel node and the object node is D,(0), and the edge weight to the background node is D,(1). The minimum cut algorithm removes enough edges to create two groups of voxel nodes. One group is associated with the object node and the other the background node. The cost C of the edges which are cut on the graph G(V, E) is equal to the edge weights to the source and sink plus the cost of cutting edges between voxels. The minimum cut algorithm finds the lowest cost cut which separate voxels into two groups. In section (3.2.2), we showed how the minimum cut corresponds to minimizing the energy. An equivalent argument can be applied to the 3D graph. The following theorem from (Greig, Porteous and Seheult, 1989) summarizes this important result. Theorem 1 If C = S, T is the minimum cut on g, then the correspondinglabeling fc is the global minimum of the energy E. The labeling fc which corresponds to the global minimum is defined as if V E S, fc(v) 0 if v E T The paper (Snow, Viola and Zabih, 2000) provides a more formal presentation of the problem description. 58 4.3 Experiments To demonstrate the success of this method, I show results from two sets of experiments. One for artificially generated data from a cylinder shape and one from data taken from the Capture Cube. 4.3.1 Graph Specification Before running experiments we need to choose a data function for edge weights to the object and background nodes and edge weights between adjacent voxel nodes. To help understand the choice, first assume that our graph consists of only one voxel node connected to the object and background nodes. If the edge weight to the object node is greater then that of the background node, we cut the edge to the background node and the voxel will be labeled as object. There is a clear interplay between the edges to the object and background nodes. We only need to choose one function which describes how much evidence there is for a voxel as being part of the object . We use this function to set the edge weight to the object node. Because of the interplay, the edge weight to the background node need to be only set to a constant. The background edges D,(1) are set to a constant value of 300. For this choice, values for the function D, (0) larger then 300, will provide evidence for this voxel as being part of the object. Intuitively, D, (0) is the cost of labeling a voxel as empty, which is large for the case where there is a lot of support for the voxel being part of the object. D, (0) is defined differently for the artificial and real experiment. In general the choice of D, (0) is going to depend on a function of the observed data. We chose the function to be the D,(O). The voxel connection weight, A, is of equal importance to the quality of the reconstructed volume. It determines the smoothness of the resulting reconstruction. Again, by assuming the problem consists of one voxel, we can better understand the choice of edge weights between voxels. If all of its neighbors are not occupied then a single voxel v may be occupied if 6A is less than the difference between Dv (0) and Dv (1). Narrow filaments are likely to be removed as A is increased. Thus, in59 tuitively we can see how this term fills in holes and smoothes edges. In all of the experiments A is 30. 4.3.2 Performance The experiments were run on a 500Mhz Intel PIII with 384 MB of memory. The computation time will depend on the size and discretization of the space. The real experiments were performed in the Capture Cube space and the artificial experiments simulated the same dimensions and camera arrangement of the Capture Cube. Voxels are cubes with each dimension having a length of 2.5cm. There are a total of 80 * 60 * 60 = 288, 000 voxels. The computation time can be divided into three computations: the graph is prepared from the images (7 secs), the max-flow of the graph is computed (1.1 secs), the minimum cut labeling of the voxels is computed from the max-flow (1 sec). The graph preparation involves a loop over every voxel where the voxel is projected into each of the 16 images. This is the same sort of operation required to compute a simple silhouette intersection. The additional computation required to find the minimum energy voxel occupancy is minimal. 4.3.3 Synthetic experiment We use synthetic data generated from a simple cylinder to compare our approach to silhouette intersection. Figure 4-2 (b), visually demonstrates a major shortcoming to the silhouette intersection approach. The characteristics of the shape that we recognize as a cylinder, large flat ends and straight parallel sides, are not preserved by silhouette intersection. To generate the data, sixteen camera views were placed in a typical arrangement in the Capture Cube . A cylinder, roughly the size of a human torso, was placed in the middle of the space and imaged using the virtual cameras. In the limit, as the number of camera views approaches infinity, the reconstructed volume will approach ground truth. In "real" experimental situations, however, there will never be enough camera for this to happen. Most objects, including cylinders, are spatially smooth. Because our approach incorporates spatial 60 Figure 4-2: Ground truth voxel reconstruction of a synthetic cylinder (left), silhouette intersection reconstruction (right), and our reconstruction (bottom). smoothing, the reconstructed volume is closer to the ground truth. The smoothing operation cleans up the protrusion on the left side of the figure 4-2(b), and smoothes out the top face. Using perfect silhouettes, reconstruction was performed using the following edge weights A = 30, background = 300 and object = 4fN where N is the number of cameras which believe this voxel is inside of the silhouette. Since the total number of cameras is 16, this weight has a maximum value of 400. Our reconstruction is shown at the bottom of Figure 4-2. 4.3.4 Real Experiments Several real datasets were acquired using the Capture Cube system. Reconstruction proceeds much as above using A = 30, D,(1) = 300, except the edge weights 61 between object and a voxel v is a function of 0(v), the observed differences in intensities at pixels that intersect v 2 Dv0) Ao(v) min(A , 400) 16 This cost function uses a truncated quadratic to determine the significance of the pixel differences. If the difference is small, in many images, Dv (0) will be small. If A (p) is very large in one or a few images, the Dv (0) will still be relatively small. Only in the case where A(p) is large in most of the images will the weight be large. This implies that if a voxel is to be labeled as part of the object, there must be support for this voxel from almost all the images. On the other hand, if we know the voxel should be included as member of object voxel set and one image does not register the corresponding pixel with a large difference measure (eg. if the clothing was the same color as that background), the algorithm can still label the voxel correctly. The algorithm does not make an absolute choice and implicitly accounts for errors in the difference measurements. In choosing the constant comparison value of 400, A(p) will be truncated at a value 20. We have chosen the truncation value of 20 through experimentation. This process can be made more formal using probabilities. The first set figures provides a comparison between our method and silhouette intersection. Figure 4-4 shows several 3D reconstructions from the images in Figure 4-3. The top reconstruction is performed using our method. Notice that the shape is quite smooth. The middle reconstruction is performed using conventional silhouette intersection. Silhouettes were found by thresholding the difference images, followed by an erode and dilate operation. The low quality of this reconstruction is due to the inaccuracies of the silhouettes, which is in turn related to the quality of the original images. Looking back at these images, notice that they are very "realistic". There is no artificial lighting, the backgrounds are quite complex, and the subject is wearing colors which appear in the background. 62 The reconstruction at the bottom of the figure uses a heuristic mechanism to improve silhouette intersection. Classic silhouette intersection requires that each occupied voxel project to a pixel within the silhouette of every image. The heuristic reconstruction labels a voxel occupied if it projects into the silhouette of 75% of all cameras. This heuristic does a good job of filling the holes, but it often yields reconstructions which are much larger than they should be. There are several observations to make about the following figures. First, the lighting and background is complex and uncontrolled. This makes silhouette extraction difficult. In addition, the subjects are wearing natural clothing which often matches the color of the background. We have made no attempt to aritifically simplify the environment to emphasize the strength of this algorithm. The algorithm is able able to recover from poor difference images by accumulating evidence from each contributing image and enforcing spatial smoothness. A second reconstruction is shown in Figure 4-5. In this case the reconstructed volume was limited to the area of the torso. 4.4 Conclusion In this chapter, we have presented a new method for solving the voxel occupancy problem which does not rely explicitly on silhouettes. In addition to its fast execution, this method has several advantages over the traditional volume intersection methods. Instead of relying on accurate silhouettes, the algorithm combines evidence from each of difference images for each voxel. This is advantageous because one bad pixel location in the difference map will not effect the calculation of the model. In addition, our formulation incorporates a spatial smoothness term which further cleans up noise in the reconstructed model. Our preliminary results are exciting. As we continue to investigate this new method a number of improvements immediately suggest themselves. For example, to enhance the smoothness over a curved surface we can use 18 or 24 neighbor connectivity. To increase speed, we might be able to use prior knowledge to run the algorithm only at the boundary of 63 Figure 4-3: Left: Eight of the 16 images captured within the Capture Cube. Right: Silhouettes computed from these images. the object. This prior information can easily be extracted from our real-time system after initialization. We can use the location of the previous model to estimate the 3D region in which to run the algorithm. 64 Figure 4-4: Three reconstructions from the images show in Figure 4-3. Top left: reconstruction using our method. Top right: reconstruction using silhouette intersection (silhouettes were computed using image differencing, and morphological operations to remove noise). Bottom: robust silhouette intersection, where a voxel is consider occupied if 3 out of 4 cameras agree that it is within the silhouette. 65 Figure 4-5: Top: four of the 16 views captured. Bottom left: one view of our reconstructed volume. Bottom right: another view of the reconstructed volume. 66 Chapter 5 Running the System 5.1 Discussion The real-time Capture Cube system constructs three to five models per second. An important part in developing a complete system is choosing and then integrating the components. Our choice of algorithms in relation to requirements, performance and quality is discussed below. With our system we are able to create 3D movies. 3D movies allow you to change the view, as the 3D data for the scene updates in time. We present results for a model constructed from a single time instant, a movie generated in real-time, and an offline 3D movie construction. For real-time performance, we use the compression and reconstruction algorithm described in appendix A. We find silhouettes by thresholding difference images. To clean up the silhouette, we use a 5x5 median filter operator and a local dilation operator to fill in holes. The more sophisticated algorithm described in section 3.2 in not used because its current implementation does no run in realtime. We pipe the reconstructed model to the standard input of to the 3D viewer, Geomview 1, as a series of rectangles constructed along each worldline . Geomview allows you to interactively change the view as the 3D models are being updated. As discussed in appendix A, the quality of the silhouettes will effect the speed 'Geomview is 3D viewing program developed at the University of Minnesota. Many of the figures in this thesis are stills from the Geomview viewer. 67 of the Worldline Algorithm . For silhouettes with lots of holes, the RLE encoding scheme will not be very efficient and thus more data will need to be transferred over the network. Noisy silhouettes will cause the Worldline Algorithm to become the main bottleneck point of the system. In addition, noisy silhouettes will result in poor quality reconstructions. As you can see, for real-time performance, there is a trade off between taking too little time and too much time preprocessing the silhouettes to clean them up. The discretization of the 3D space will also affect the performance of the Capture Cube system. For more detailed models, you will want to discretize the space on a finer resolution grid. The contributions to each worldline need to be intersected using the parent process. If there are too many worldlines to process, the speed of the system will be limited by this computation. We use a 64x64 grid of worldlines which are one inch apart. We have found that the biggest slowdown is the network communication. This is not a function of the algorithms or the hardware, but rather the operating system software. (A Linux kernel guru has informed us that Linux Redhat Hat 6.0 deals much better with local network communications.) With a dedicated local network hub, and an upgrade to our operating system from Redhat 5.2 to Rehat 6.0, we estimate the Capture Cube system will be able to generate 10 models per second. If real-time performance is not required we can generate better models using the algorithms described in chapters 3 and 4. These algorithms are still relatively fast: 3D models can be constructed at a rate of 10-30 models per minute on a single computer. 5.2 Results Figure 5-1 provides a reconstruction of the virtual world with accurate camera placement to give the reader a sense of the Capture Cube space. Figure 5-2 shows several different views of a reconstruction. Six camera views have been used to reconstruct this model. Four views are shown. Cones originating from two of the 68 cameras are shown to demonstrate volume intersection. We also have created several higher quality models in (5-5, 5-7, 5-8) using images from the Capture Cube system at one time instant. The silhouettes for these models were hand segmented. The model in figure 5-8 was the first 3D model created by the system. For demonstration purposes with have created several movies of the object moving while the view is changing. We will show a series of these frames here. Figure 5-9 shows a sequence of images generated from a 3D movie. Although these images were not generated in real-time, there are comparable in quality to models from real time runs. Figures (5-10, 5-11, 5-12) and (5-13) show a sequences of images generated from a 3D movie whose models were reconstructed offline. The silhouettes were constructed using the method in 3.2. The models were constructed using the basic volume intersection algorithm Figure 5-1: The reconstructed virtual world with cameras and image planes. Note, the figure is not in the center of the reconstructed area. 69 Figure 5-2: Several views of a reconstruction. Cones originating from two of the cameras are shown to demonstrate volume intersection. Figure 5-3: Original images used for silhouette in 5-4. 70 Figure 5-4: The 12 silhouettes used to reconstruct the model in 5-5. Figure 5-5: Six views of a reconstructed model. 71 Figure 5-6: Silhouettes from these 12 images were used to reconstruct the model in 5-7. I Figure 5-7: Several views of a 3D reconstruction. 72 Figure 5-8: Our first 3D reconstruction from two views. Only six cameras were used to generate this model. Notice, in the image on right, that there are not enough cameras to carve away part of the volume in front of the torso region. 73 Figure 5-9: Images from a 3D "frisbee". One of the frames captures the frisbee in mid air. 74 Figure 5-10: Images from a 3D movie of a person lifting a box (Part I.) 75 Figure 5-11: Images from a 3D movie of a person lifting a box (Part II.) 76 Figure 5-12: Images from a 3D movie of a person lifting a box (Part III.) 77 Figure 5-13: Images from a 3D movie of a figure dancing. 78 Appendix A Data transfer and Fast Reconstruction A.0.1 Introduction One of the main difficulties of a real time system is the incredible amount of data that needs to transferred over the network for processing. The bandwidth needed to transmit the complete uncompressed images to a central computer is not available on our network. In addition, the bandwidth needs increase linearly with the number of cameras. Typically, we can only achieve a maximum 2 megabytes per second throughput over the network. A color image from one camera consists of 230400 bytes. For sixteen cameras, one time instant consists of 3,686,400 bytes and at 30 frames per second we would need 110,592,000 bytes throughput over the network which is roughly fifty times the maximum available. Clearly, we need to do preprocessing and compression to reduce the transfer load. In fact, we would want to distribute the computation load as much as possible. Even if 100 million megabytes per second could be transferred, they could not be processed in real time on current hardware. In this investigation, we have simplified the problem a great deal by restricting ourselves to work with binary images. In this section, we discuss a new method of compression which requires approximately 2000 bytes per image. At 30 frames per second with sixteen images this is only 960,000 bytes per second, and well under the maximum throughput 79 level. An additional advantage to this scheme is that it also does not need to decompress the data at the other end, thus saving additional computational resources. I am going to summarize the main result of (Ozier, 1999) 1 that describes a novel scheme of compression that is tightly integrated with volume reconstruction. Simply put, the algorithm consists of compressing each image using run length encoding along a global set of lines. A.0.2 Compression, Worldlines and Imagelines Binary silhouettes of typical objects and human subjects typically are not very complex. They have smooth boundaries and few holes. For example, in figure A-1, a line is drawn across a human silhouette. Along this line, in the following order, there are 88 off pixels, 41 on pixels, 23 off pixels, 11 on pixels, and 47 off pixels. With a simple RLE 2 this data can be encoded with 5 bytes. RLE compression will break down when there is noise in the binary image. The "runs" of constant values will be shorter and there will be more of them requiring greater data storage. With this in mind, we can see the importance of accurate segmentations. In Chapter 3, we will discuss methods for creating silhouettes which have denoising properties and emphasize neighborhood connectivity. By choosing the direction of encoding in each image intelligently, we are able to merge the compressed data from all the images without first decompressing. To do this, we map a set of vertical lines, defined in a global coordinate system, to each image. We define the global "virtual" lines as worldlines ; the lines they map to on the image plane we call as imagelines . Figure A-2 shows the correspondence between worldlines and imagelines . The worldlines are placed on a grid whose 'When I came into the project in January 1999, Owen Ozier and my Advisor Professor Paul Viola were still in the discussion stages of the compression algorithm development. With Owen's help, I integrated this new approach into the Capture Cube system. 2 Run Length Encoding is a compression scheme that encodes data by counting the number of elements of the same type which occur in a row. The output of the algorihtm is a count of the number of elements followed by the element type. For example, the binary sequence 000001111000 would be encoded by (5 0 4 1 3 0). 80 Figure A-1: A well segmented silhouette with a line overlayed indicating the direction of run length encoding. width and depth correspond to dimensions of the Capture Cube. The height of the worldlines corresponds to the height of the space. The volume is reconstructed by determining which locations on each of the worldlines would be inside the physical volume. (b) (a) Figure A-2: (a) A set of worldlines mapping onto a virtual image plane; (b) a close of this camera, worldline system, where the right most imageline matches up with one of the worldlines . The images are RLE coded along the imagelines and the data is passed to a central computer for processing. A particular worldline will correspond to an imageline in each of the images. Before transmission, the data along each of the imagelines is compressed such that the transitions are labeled. By merging the set of transitions 81 along each worldline , the volume will be reconstructed. This method is a form of silhouette intersection as discussed in 1.2. It can be thought of as a hybrid scheme between voxel projection and volume intersection, where the volume is discretized along the X,Y dimensions and remains continuous along the vertical dimension. A.0.3 Discussion Will call this algorithm the Worldline Algorithm based on its use of a global set of lines that define an encoding direction. Figure A-3 and A-4 illustrate merging the contributions from each imageline to set the "on" portion of the worldline. Figure A-3 shows a worldline which intersects the middle of the figure. The three cameras which see the worldline will think the worldline intersects different subsets of the volume. For example, the camera in back thinks the worldline intersects most of the figure, but the camera on the left knows it does not intersect the legs. A camera will almost always think that the worldline intersects more of the object then it actually does. Merging the contributions from each of the imagelines , that correspond to the worldline , results in a more accurate description of where the worldline actually intersects. In fact, as the number of views is increased, the worldlines will appear to intersect the "visual hull" of the object. The resolution and size of the worldline grid affects the speed of the whole system. The system will run at 10 models a second with a grid of 64x64 worldlines . At double the resolution, the number of frames per second drop by a factor of 4. 82 I 7 Figure A-3: Top: Three camera views, a reconstructed figure and a worldline. Figure A-4: Three runs corresponding to the left, middle and right camera views from figure A-3. The run on the right is the intersection of those three runs. 83 Appendix B Related Work There are currently several groups working on systems which have similar goals. Seitz and Dyer present a voxeling coloring scheme in (Seitz and Dyer, 1995) based on a clever ordering of voxels. They are able to choose a color value for each voxel with one pass on a discretized volume. The algorithm runs in 0(n) time - linearly related to the number of voxels. Therefore, the algorithm runs relatively quickly. On the downside, the cameras need to be placed in an acceptable configurations to ensure that the voxel ordering scheme will not break down. For example, cameras pointing in the outward direction are an acceptable. In addition, the algorithm relies on accurate camera calibration with respect to both geometry and color. Pixels, in different cameras, which correspond to the same point on the object need to be the same color. Because the computation of the geometry is local, exact calibration is required. Kanade et al. (Narayanan, Rander and Kanade, 1998) have set up a multicamera system with close to 50 cameras. They use the multi-baseline stereo approach to calculate the geometry. Again, with accurate camera calibrations, this method can result in accurate 3d models. The accuracy of the model will depend on the stereo algorithm used and will suffer from the problems typically associated with stereo. For example, stereo algorithms take a long time to run and have problems in regions where there is a discontinuity of depth. Kanade et al. have not yet focused on a real-time system. The current system spends many hours 84 post-processing to compute a sequence of 3d models. The multi-baseline approach requires access to only a subset of the image of data, so the computation can be potentially done in parallel. Leonard McMillan et al. (Buehler, Matusik and McMillan, 1999) have developed a real-time system which focuses on rendering and graphics. Their construction never explicitly computes a 3d model, but instead, recomputes the best image for the view at each moment. The visual result is very realistic and it runs in real-time at about 20 frame/sec. Their are two potential downside to their apporach. First, a powerful parallel computer connected directly to the data source is necessary to process all the input data. Second, they do not contruct an actual 3 dimensional model. With a 3 dimensional model, the system is limited to what applications it can be applied towards. For example, it would be difficult to integrate these models into an artificial 3d world without an explicit representation of the geometry. 85 Appendix C Radial Distortion To improve the accuracy of the algorithm, I included a radial distortion term in the camera model. A quick visual inspection will reveal that lines in images are not straight when the should be. This is the result of radial distortion. Radial distortion can be modeled as: - 6y (xd- xo)(K 2 r + K4r|) =(yd - yo)(K 2 rd +K 4 r) (C.1) where the subscript u indicates and undistorted pixel, d a distorted pixel and xO, yo is the center of the radial distortion. Experimentally and as discussed in (Stein, 1993) only the first term K 2 is necessary to compensate for radial distortion. To find the radial distortion, I used a method describe in (Stein, 1993). First I snapped an image of a 8.5x11 sheet of black and white stripes (see figure C-1. I then manually selected points along the lines. Starting with an estimate for the each of the terms of the radial distortion parameters, X0 , Yo, K 2 , K 4 ), I find the line 86 that best fits the updated points, which have been adjusted to account for radial distortion. The best line is found by setting up a least squares problem and solving using SVD. If the line gets better by adjusting the parameters, I update the current set and continue to iterate. Experimentally, the center of radial distortion varies significantly, typically +- 30 pixels, from the center of a 640x480 image. Using the same camera, and imaging the same sheet of black and white stripes will result in different estimates for the center of radial distortion. This distortion term is also very sensitive to the manual point placement along the line. Although 30 pixels seems significant, it is less then a quarter of pixel for typical values of the K 2 term which are on order of 1.Oe - 7. I can safely ignore the parameters that define the center of radial distortion and assume that xo and yo are in the center of the image. I can also ignore the K 4 term, which is on order of 1.0e-13. because it does not make a significant contribution to the undistorted pixel location. For our Intel Create and Share cameras, the distortion is significant and is corrected with K 2 = 4.1e - 7. Figure C-1: Two images used to calculate the radial distortion correction. Notice the clear bowing effect in both images due to the effects of radial distortion. 87 References Bouguet, J.-Y and Perona, P. (1998). 3d photography on your desk. In ICCV 98 Proceedings. Boykov, Y., Veksler, 0., and Zabih, R. (1999). Fast approximate energy minimization via graph cuts. In CVPR 99, pages 377-383. Buehler, C., Matusik, W., and McMillan, L. (1999). Creating and rendering imagebased visual hulls. Technical report, Massachusetts Institute of Technology. Cherkassky, B. and Goldberg, A. (1997). On implementing push-relabel method for the maximum flow problem. Algorithmica, 19:390-410. Cormen, T. H., Leiserson, C. E., and Rivest, R. L. (1991). Introduction to Algorithms. MIT Press, Cambridge, Mass. DeBonet, J. S. and Viola, P. (1999). Roxels: Responsibility weighted 3d volume reconstruction. In Proceedingsof ICCV. Dyer, C. R. (1998). Image-based visualization from widely-separated views. In Proc. Image UnderstandingWorkshop, pages 101-105. Faugeras, 0. (1993). Three-DimensionalComputer Vision: a Geometric Viewpoint. MIT Press. Greig, D., Porteous, B., and Seheult, A. (1989). Exact maximum a posteriori estimation for binary images. Journalof the Royal Statistical Society, Series B, 51(2):271279. Grimson, W. E. L. and Viola, P. (1998). Immersive sporting events. Grant proposal to n.t.t., Massachusetts Institute of Technology. Grzeszczuk, S. J. G. R., Szeliski, R., and Cohen, M. F. (1996). The lumigraph. In SIGRAPH '96. 88 Gu, X., Gortler, S., Hoppe, H., McMillan, L., Brown, B., and Stone, A. (1999). Silhouette mapping. Computer Science Technical Report TR-1-99, Harvard University. Horn, B. K. P. (1986). Robot Vision. MIT Press, Cambridge, Mass. Laurentini, A. (1994). The visual hull concept for silhouette-based image understanding. IEEE Trans. Pattern Anal. Machine Intell., 16:150-162. Levoy, M. and Hanrahan, P. (1996). Light field rendering. In SIGRAPH '96. Lorensen, W. and Cline, H. (1987). Marching cubes: a high resolution 3d surface construction algorithm. In Computer Graphics SIGGRAPH '87, pages 163-170. Matsui, K., Iwase, M., Agata, M., Tanaka, T., and Ohnishi, N. (1998). Soccer image sequence computed by a virtual camera. IEEE Trans. Pattern Anal. Machine Intell., 16:860-865. McMillan, L. and Bishop, G. (1995). Plenoptic modeling: An image-based rendering system. In Computer GraphicsProceedings. Mellor, J. (1995). Enhanced reality visualization in a surgical environment. Technical Report 1544, Massachusetts Institute of Technology. Narayanan, P., Rander, P. W., and Kanade, T. (1998). Constructing virtual worlds using dense stereo. In ICCV 98 Proceedings, pages 3-10. Ozier, 0. (1999). Variable viewpoint reality: A prototype for realtime 3d reconsruction. Master's thesis, Massachusetts Institute of Technology. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipies in C. Cambridge University Press, Cambridge, England. Prock, A. C. and Dyer, C. R. (1998). Towards real-time voxel coloring. In Proc. Image UnderstandingWorkshop, pages 315-321. 89 Rander, P. W., Narayanan, P. J., and Kanade, T. (1996). Recovery of dynamic scene structure from multiple image sequences. In 1996 Int'l Conf. on Multisensor Fusion and Integrationfor Intelligent Systems, pages 305-312, Washington, D.C. Seitz, S. and Dyer, C. (1997). Photorealistic scene reconstruction by voxel coloring. In CVPR 97 Proceedings, pages 1067-1073. Seitz, S. M. and Dyer, C. R. (1995). Physically-valid view synthesis by image interpolation. In Proc. Workshop on Representations of Visual Scenes, Cambridge, MA. Seitz, S. M. and Dyer, C. R. (1996). View morphing. In SIGGRAPH 96. Serra, J. (1982). Image Analysis and Mathematical Morphology. Academic Press. Snow, D., Viola, P., and Zabih, R. (2000). Exact voxel occupancy with graph cuts. In CVPR 2000. Stein, G. P. (1993). Internal camera calibration using rotation and geometric shapes. Master's Thesis AITR-1426, Massachusetts Institute of Technology. Sullivan, S. and Ponce, J. (1996). Automatic model construction, pose estimation, and object recognition from photographs using triangular splines. Technical report, Beckman Institute, University of Illinois. Sullivan, S. and Ponce, J. (1998). Automatic model construction and pose estima- tion from photographs using triangular splines. Transactionson PatternAnalysis and Machine Intelligence, 20(10):1091-1096. Szeliski, R. (1993). Rapid octree construction from image sequences. Computer Vision, Graphicsand Image Processing,58(1):23-32. Tsai, R. Y. (1987). A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses. IEEE Journalof Robotics and Automation, RA-3:323-344. 90 Wei, G.-Q. and Ma, S. D. (1994). Implicit and explicit camera calibration: Theory and experiments. IEEE Trans. PatternAnal. Machine Intell., 16:469-480. Zhang, Z. (1998). A flexible new technique for camera calibration. Technical Report MSR-TR-98-71, Microsoft Research. Zheng, J. Y. (1994). Acquiring 3-d models from sequences of contours. IEEE Trans. PatternAnal. Machine Intell., 16:163-178. 91