3D Stereo Reconstruction using iPhone Devices Final Report Submitted by: Ron Slossberg Omer Shaked Supervised by: Aaron Wetzler -1- Table of Contents List of Figures ........................................................................................................................................... 4 Abstract.................................................................................................................................................... 5 List of Abbreviations ................................................................................................................................ 6 1. 2. Introduction ..................................................................................................................................... 7 1.1 General Project Description..................................................................................................... 7 1.2 Programming Environment ..................................................................................................... 7 1.2.1 OpenCV Libraries ............................................................................................................. 8 1.2.2 OpenGL Libraries.............................................................................................................. 8 Theoretical Background ................................................................................................................... 9 2.1 Pinhole Camera Model .................................................................................................................. 9 2.1.1 The Pinhole Model Geometry........................................................................................ 10 2.1.2 Homogeneous Coordinates ........................................................................................... 10 2.2 Epipolar Geometry ...................................................................................................................... 11 2.3 Stereo Correspondence ............................................................................................................... 13 2.4 Reconstructed Scene ................................................................................................................... 14 3. Basic System Functionalities .......................................................................................................... 15 4. Software Implementation.............................................................................................................. 18 4.1 Model-View-Controller Design Pattern ................................................................................. 18 4.2 Top-Level View....................................................................................................................... 19 4.3 Main Menu ............................................................................................................................ 20 4.3.1 User Interface ................................................................................................................ 20 4.3.2 Features ......................................................................................................................... 21 4.4 Calibration View..................................................................................................................... 21 4.4.1 User Interface ................................................................................................................ 22 4.4.2 Calibration Process ........................................................................................................ 22 -2- 4.5 Reconstruction View .............................................................................................................. 24 4.5.1 User Interface ................................................................................................................ 24 4.5.2 Reconstruction Process ................................................................................................. 24 4.6 Photo Album View ................................................................................................................. 26 4.6.1 User Interface ....................................................................................................................... 26 4.7 Interactive Image View .......................................................................................................... 27 4.7.1 4.8 Settings View ......................................................................................................................... 27 4.8.1 4.9 User Interface ................................................................................................................ 27 User Interface ................................................................................................................ 28 Main Implementation Issues ................................................................................................. 28 4.9.1 Simultaneous Photo Capture ......................................................................................... 28 4.9.2 Communication between Devices ................................................................................. 29 5. Results............................................................................................................................................ 31 6. Future Directions ........................................................................................................................... 34 7. Summary ........................................................................................................................................ 35 8. References ..................................................................................................................................... 36 -3- List of Figures Figure 1 - Pinhole Model Geometry ...................................................................................................... 10 Figure 2 - Epipolar Geometry................................................................................................................. 11 Figure 3 - Smoothness Constraint Paths in SGBM ................................................................................. 13 Figure 4 – iPhone Devices Fixed to a Handle ......................................................................................... 15 Figure 5 - A side-view of a reconstructed scene .................................................................................... 17 Figure 6 - The MVC Design Pattern Structure........................................................................................ 18 Figure 7 - Storyboard View of the App .................................................................................................. 19 Figure 8 - Main Menu View ................................................................................................................... 20 Figure 9 - Calibration View..................................................................................................................... 21 Figure 10 - Calibration Process Flow Chart ............................................................................................ 22 Figure 11 - Reconstruction View ............................................................................................................ 24 Figure 12 - Reconstruction Process Flow Chart ..................................................................................... 25 Figure 13 - Photo Album View ............................................................................................................... 26 Figure 14 - Interactive Image View ........................................................................................................ 27 Figure 15 - Settings View ....................................................................................................................... 27 Figure 16 - Possible Settings of the Two Devices .................................................................................. 31 Figure 17 - Gray-Scale Depth Maps ....................................................................................................... 32 Figure 18 - Depth Map Examples........................................................................................................... 32 Figure 19 - Different Phases of the Same Scene.................................................................................... 33 -4- Abstract Stereo Reconstruction is a common method for obtaining depth information about a given scene using 2D images of the scene taken simultaneously by two cameras from different views. This process is done by finding corresponding objects which appear in both images and examining their relative positions in the images, based on previous knowledge of the internal parameters of each camera and the relative positions of both cameras. This method relies on the same basic principle that enables our eyes to perceive depth. In this project we have implemented a completely mobile 3D stereo reconstruction system built from two iPhone 4S devices, by using them both to capture the images and perform all of the required computational processes. The iPhone app we have developed enables to perform both the calibration process for obtaining the required intrinsic and extrinsic parameters and the stereo reconstruction process itself. The Computer Vision algorithms we have used for both calibration and reconstruction processes were all provided by OpenCV C++ libraries. -5- List of Abbreviations MVG CV GL RMS BM SGBM MVC UI RTT Multiple View Geometry Computer Vision Graphics Library Root Mean Square Block Matching Semi-Global Block Matching Model-View-Controller User Interface Round Trip Time -6- 1. Introduction Computer Vision is a field which requires performing heavy computational tasks. This is why it traditionally required the use of a powerful backbone computer which received the captured images as an input, and performed all those complex computations. With the continuous improvement in performance of modern processors in general, and processors of mobile devices in particular, applications that used to require extensive infrastructure of computing resources, can now be performed on just a small iPhone device. As a demonstration of those increased capabilities nowadays mobile devices posses, we decided to build a mobile app that implements a complete stand-alone stereo reconstruction platform from two iPhone 4S devices. The iPhone devices perform all the required tasks for generating depth images – they capture the images using their cameras, communicate with each other to transfer data using Bluetooth protocol and compute by themselves the reconstructed depth images using well-known, open source computer vision algorithms. 1.1 General Project Description This project was divided into two main stages. In the first stage we got familiarized with computer vision and in particular with stereo vision, by learning the basic theoretical background of the pinhole camera model and epipolar geometry, together with performing basic stereo calibration exercises in MATLAB using a standard camera calibration toolbox. In addition, we have also learned the basics of Objective-C and iOS programming. In the second stage, we have built our 3D stereo reconstruction iOS app. We started by implementing the stereo calibration process, then moved to implementing the reconstruction process and finally, based on OpenGL libraries, we have added an interactive 3D display of the reconstructed scenes. 1.2 Programming Environment Building an iOS app is done in Objective-C language using the Apple Xcode SDK. Objective-C is an object-oriented extension of C, and thus supports all the basic functionalities of C. In addition, we also used C++ to support the integration of OpenCV and OpenGL libraries into our project, which are both implemented in C++. Mobile application programming is done using the basic Model-View-Controller design pattern. This pattern is used to separate the background computational model of the app from the interactive views that are being displayed to the user. The model is in charge of preparing the data required for the app to function. The controller, also referred as the view controller, is responsible for obtaining the needed data from the model and displaying the correct views on the screen. The view is the actual object that is presented on screen. -7- 1.2.1 OpenCV Libraries OpenCV is an open source library implementing computer vision functions. It provides realtime computer vision infrastructure that enables to implement relatively complicated systems, that would require much more effort if were to be built from scratch. In this project we used the C++ OpenCV code alongside comprehensive guides and documentation to perform all the required CV tasks: stereo calibration using a chessboard pattern to retrieve the intrinsic and extrinsic parameters, un-distorting and rectifying the captured images based on the parameters that were computed during the calibration phase, and finally the stereo correspondence process of matching pixels between the two captured images and reconstructing the depth scene. In addition, we have used an OpenCV interface that handles the communication between the app and the iPhone's camera. Regarding the use of OpenCV libraries, our main goals were to successfully integrate the implemented functions into our code, and mainly enable reliable data flow between the different functions, together with providing the ability of persistently storing the relevant data throughout multiple runs of the app. 1.2.2 OpenGL Libraries The powerful graphics library OpenGL (Open Graphics Language) is a cross-language, multiplatform API for rendering 2D and 3D computer graphics. The iOS API includes a lightweight version of this library called OpenGL ES (for embedded systems). This library allowed us the flexibility and performance we needed in order to render the 3D scene as 3D surface (triangle mesh) onto the screen of the iPhone. The library taps the power of the iPhone’s hardware graphics acceleration for the task of rendering and manipulating the scene. -8- 2. Theoretical Background 2.1 Pinhole Camera Model 3D projection is any method of mapping three-dimensional points to a two-dimensional plane. Every imaging device whether artificial (camera) or natural (eye) projects the 3D world onto a 2D image. The mathematical formulation of this projection is called the projection model. Projection models are essential mathematical tools when dealing with photographs and when displaying 3D images on a computer monitor. The simplest and most general projection model is the Pinhole Camera Model. This model describes the mathematical relationship between the coordinates of a 3D world point and its 2D projection onto the image plane of a pinhole camera. A pinhole camera is a theoretical “ideal” imaging device where the camera aperture is described as an infinitesimally small point and no lenses are used to focus light. In reality there are no ideal imaging devices. For instance, the model does not take into account many imaging artifacts as geometric distortions or blurring of unfocused objects caused by lenses and finite sized apertures. The model also doesn’t take into consideration the fact that the image plane is made up of discrete light detectors (pixels). This means that the pinhole camera model can only be used as a first order approximation of the mapping from a 3D scene to a 2D image. The validity of this approximation depends on the imaging system (camera) being used and even varies within the image itself (lens distortions are often more pronounced near the edges of an image). As mentioned earlier, some of the abovementioned artifacts become negligible when using a high quality camera, and others like lens distortion can be corrected by modeling the distortion effect and then reversing it through a coordinate transformation. This means that the pinhole camera model holds for most practical applications in computer vision and computer graphics. -9- 2.1.1 The Pinhole Model Geometry Figure 1 - Pinhole Model Geometry The figure above depicts the pinhole model geometry: C is the principal point. f is the focal length. u, v are the coordinates in the image plane. x, y, z are the coordinates in the real world. I is the image plane. P is some point in the x,y,z system and p is its projection in the image plane. The projection is given by the intersection of the line connecting the principal point to the 3d point with the image plane. When given a point in 3D space we wish to find its projection in the image plane. From triangle resemblance we can deduce that the point p in the image plane is given by the formula u x f x u f z z u v v y f y v f z z x f z y 2.1.2 Homogeneous Coordinates To further simplify the notation we aspire to write this as a linear operator represented by a matrix operating on a vector. To achieve this we must switch to homogenous coordinates. -10- The image coordinates are represented u u ' w by v v' w w 1 while the real world coordinates are x x u ' f 0 0 0 y represented by y . This allows us to use the following matrix equation v ' 0 f 0 0 . z w 0 0 1 0 z 1 1 To find the actual u,v in the image plane one must divide by the 3rd coordinate w. This matrix is called the camera matrix, and it represents the projective transform as a linear operator on the homogenous coordinates. In reality the camera matrix has some more factors added to it which account for imperfections in the real camera: f x u0 0 C 0 f y v0 0 0 0 1 0 Where f x , f y represent the different focal lengths in the u, v directions, represents the skew of the imaging plane and u0 , v0 represent the location of the principal point in the u, v plane. 2.2 Epipolar Geometry Epipolar geometry is the geometry of stereo vision. When two cameras view a 3D scene from two distinct positions, there are a number of geometric relations between the 3D points and their projections onto the 2D images that lead to constraints between the image points. These relations are derived based on the assumption that the cameras can be approximated by the pinhole camera model. Figure 2 - Epipolar Geometry -11- In stereo photography we use two cameras to capture (project) a 3D scene from two distinct locations. We assume these locations (relative to each other) are known in advance. Each camera has its own camera matrix which is used to create the projected image. Using the information in both images combined with the constraints of epipolar geometry we can find the distance in space of a point in the image, essentially recreating the depth of the image. Since the centers of projection of the cameras are distinct, each center of projection projects onto a distinct point into the other camera's image plane. These two image points are denoted by eL and eR and are called epipolar points (epipoles in short). Both epipoles in their respective image planes and both centers of projection OL and OR lie on a single 3D line as seen in the figure above. The line OL X is seen by the left camera as a point because it is directly in line with that camera's center of projection. However, the right camera sees this line as a line in its image plane. This line shown in red color in the figure above is called an epipolar line. Symmetrically, the line OR X , seen by the right camera as a point, is seen as epipolar line by the left camera. An epipolar line is a function of the 3D point X, i.e. there is a set of epipolar lines in both images if we allow X to vary over all 3D points. Since the 3D line OL X passes through the center of projection OL , the corresponding epipolar line in the right image must pass through the epipole eR and vice versa. This means that all epipolar lines in one image must intersect the epipolar point of that image. In fact, any line which intersects with the epipolar point is an epipolar line since it can be derived from some 3D point X. If the relative translation and rotation of the two cameras is known, the corresponding epipolar geometry leads to two important observations: 1) If the projection point X L is known, then the epipolar line eR X R is known and the point X projects into the right image, on a point X R which must lie on this particular epipolar line. This means that for each point observed in one image the same point must be observed in the other image on a known epipolar line. This provides an epipolar constraint which corresponding image points must satisfy and it means that it is possible to test if two points really correspond to the same 3D point. 2) If the points X L and X R are known, their projection lines are also known. If the two image points correspond to the same 3D point X the projection lines must intersect precisely at X . This means that X can be calculated from the coordinates of the two image points, a process called triangulation. The epipolar constraint in stereo images can be described using a vector equation. We denote F the fundamental matrix. F is a 3X3 Matrix which relates corresponding points in image pairs. Given a point x in one image, the value F x gives an epipolar line on which x must lie -12- in the corresponding image. This gives rise to the constraint x 'T F x 0 . Since x’ lies on the epipolar line defined by F x this equation must hold for every x and x’. 2.3 Stereo Correspondence As previously shown, two images of the same scene must comply with the epipolar constraint which is mathematically formulated as x 'T F x 0 . Also, we saw that given the corresponding matched coordinates in two images we can triangulate the distance of the point from the camera. From this we deduce that if we have a method to match each point in one image to a point in the second image, we would be able to triangulate the distance for every one of the matched points. The process of finding matches for one image in another is called image correspondence, and in our case since we are performing stereo imaging it is called stereo correspondence. There are many methods for performing stereo correspondence, each having its own advantages and drawbacks. We chose to use two algorithms that were offered by OpenCV which gave us good results and were fairly efficient: Block Matching – This method is the most efficient from the methods available in OpenCV but did not give us very good results. The algorithm looks at the block surrounding a pixel and finds correspondence to blocks along the corresponding epipolar line. Once the best correspondence is found the disparity is recorded and then later can be translated into distance. This method is a local method which has no consideration for global constraints such as image smoothness etc. Semi Global Block Matching – This method was noticeably less efficient than the BM algorithm, however, it produced much better results which justified the extra complexity. This method imposes a set of smoothness constraints into the matching process. The constraints are added by introducing a penalty for large difference in disparity between neighboring pixels along lines in the picture which intersect at a given pixel (illustration in fig 2.3). The disparity is an optimization of the best local correspondence with the lowest global smoothness cost. Figure 3 - Smoothness Constraint Paths in SGBM -13- 2.4 Reconstructed Scene The output of the stereo correspondence algorithms is a disparity map. This is a gray level image that encodes the disparity (horizontal distance between matched points in both images) as a number between 0-255. Knowledge of disparity can be used in further extraction of information from stereo images. In our case, the disparity is used in order to carry out a depth/distance calculation. Disparity and distance from the cameras are negatively correlated. As the distance from the cameras increases, the disparity decreases. This allows for depth perception in stereo images. Using geometry and algebra, the points that appear in the 2D stereo images can be mapped as coordinates in 3D space. Consider recovering the position of P from its projections pl and pr xl f Xl x Z X x Z X l l l , xr f r X r r r Zl f Zr f In general, the two cameras are related by the following transformation Pr R Pl T where T and R are the rotation matrix and the translation vector respectively between the two cameras. x Z x Z T f T r Z Using Z r Zl Z and X r X l T we have l where d xl xr f f d is the disparity. Once the image is re-projected into real world coordinates, we have for each pixel its original (x,y,z) position. Additionally, we have a 3D color coordinate (RGB) for each pixel as well. All this data is saved for later use in the 3Dscene viewer. The viewer takes the 3D coordinates and 3D color coordinate for each pixel and stuffs them into an OpenCV vertex array. The vertex array is then rendered as a triangle mesh to give a representation of the scene as a sort of 3D surface, which can be rotated by touching the screen. -14- 3. Basic System Functionalities The system that was implemented has several basic functionalities that enable it to perform the complete stereo reconstruction process. In this section, we will give a short description of each stage in the process of generating a 3D reconstructed image. A detailed description of the implemented software itself will be given in section 4. First of all, after fixing the two iPhone devices to a dedicated handle, a Bluetooth connection is established between the two devices. From that point on, the user can control the entire app functionality from just a single device (which automatically sends the relevant commands to the other device), and thus prevented from having to juggle between the two devices. After that, we move on to performing the calibration process. It should be noted that the calibration process is not needed to be done before any activation of the app, but should be done only after changing the relative positions of the devices. If both devices remain fixed to a handle between several runs of the app, the calibration should be done just once at the beginning, and from there on the computed calibration parameters remain stored inside the app. Before running the calibration itself, we need to enter a few parameters describing the chessboard pattern we are going to use for calibration. Those parameters, which include the number of internal corners on both axes of the chessboard and the dimensions of each square in the chessboard are of course entered on one device, and are automatically transferred to the other device as well. In addition, we need to set a parameter that notes whether this device is the left device or right device relatively to the other device. Figure 4 – iPhone Devices Fixed to a Handle -15- After entering the parameters, we can start performing the calibration. The devices are held in front of the chessboard pattern, and the user will capture several photos of the pattern (usually between 5-15 photos). Before capturing each photo, the user must make sure that the pattern is seen by the cameras of both devices, and this is easily done as the iPhones are displaying in real-time the current view of the iPhone's camera throughout the calibration process (same view that appears when taking a regular photo with an iPhone devices). The user presses the capture button at only one of the devices, and the app is in charge of synchronizing the devices to both capture photos at approximately the same time. After taking each photo, the app will shortly display the captured image with the extracted chessboard corners highlighted, before returning back to the calibration screen. When done capturing the required amount of photos, the user presses the calibration button to start the calibration process. The app will use several OpenCV algorithms to calculate all the intrinsic and extrinsic parameters required later for the reconstruction stage. A successful calibration should end with an RMS value of about ~0.2 pixels. As the calibration stage is completed, we can move on to using the main feature of the app and start capturing images that will go through a 3D reconstruction process. As the user enters the reconstruction screen, the app immediately calculates the remapping that will be done for each pixel in the images to be captured to enable un-distortion and rectification of every captured image. This process is done automatically only once, when entering the reconstruction screen, thus saving time of the reconstruction process itself that is performed after every capture of an image (as multiple images can be captured over a single visit to the reconstruction screen). Before starting to capture images, the user can choose whether to use a high or low quality of reconstruction. If choosing low, the app will use a Block Matching algorithm to compute the stereo correspondence between the images. If choosing high, the app will use the SGBM algorithm instead (for a detailed description of the algorithms and the stereo correspondence process, go to section 2). Even though the BM algorithm is significantly faster, the results of the SGBM algorithm are of much higher quality, so the recommendation is to set the quality to high. If one would like to compare the results from the two algorithms, a recommended setting would be to assign one device to high quality and the other to low quality. Now, the user can start capturing images. As with the calibration process, the capture button is only pressed at one device, and this device orders the other device to capture an image as well. After each capture, each device sends the image it captured to the other device. Then, each device uses the previously calculated map to perform un-distortion and rectification of the images, before sending them to a set of OpenCV functions. Those functions compute the disparity map of the captured scene, and when computation is finished, the grayscale depth image is displayed on screen for a short period. After capturing a single image or a set of images, the user can go into the photo album to view the results of the reconstruction process. At the main table, appears a list of all the -16- reconstructed images that are stored by the app. Each entry contains a small image of the scene's depth map, intended to help the user choose which image he likes to watch, and also estimate the quality of reconstruction. When choosing a specific entry in the table, the user will get a color 3D interactive point-cloud display of the scene, which combines the original images and the calculated disparity values. This display can be rotated to different angles of view that allows observing the actual perception of depth that was obtained for this specific image. Figure 5 - A Side View of a Reconstructed Scene -17- 4. Software Implementation A mobile app implementation is of its nature highly view-driven, and this is why the most commonly used design pattern for such apps is the Model-View-Controller design pattern. In the same manner, in this chapter we will describe the implemented software by presenting each of its views, and elaborating on the functionalities of each such view. At the end of this section we will also describe several implementation issues we had to face and the solution we have chosen to deal with those issues. 4.1 Model-View-Controller Design Pattern The MVC design pattern, which is almost the only reasonable way to develop a mobile app, assigns each object a role – model, view or controller. In addition, it also determines the way these three types of objects communicate with each other. Figure 6 - The MVC Design Pattern Structure The model is actually the brain of the app. It performs all the required background logic that allows the app to supply the needed functionality, but has no effect on the way the app is displayed to the user on screen. The ideal model would actually have no notion of the view that is actually displaying its data, and thus can supply data to multiple views in the app. The model is also in charge of holding the persistent state of the app, since the controller and the view frequently change during the activation of an app, and the persistency of the app is mainly maintained by the persistency of the data in the model. The view object describes what the user sees on screen. It has no notion of the model which holds the data that it presents. The view has two main functionalities: it knows how to draw itself when it is set to appear on screen, and it knows how to respond to various user actions. The controller is basically the intermediary between the model and the view. The controller is in charge of passing data between the model and the view, thus telling the view how to display itself and telling the model which tasks to perform. In addition, the model handles the lifecycles of different views, by deciding which view will appear on screen at any given time. As explained in brief in each object's description, the communication between the objects is also defined by this design pattern. The model and the view never speak directly with each -18- other. The controller communicates directly with the model by calling its public methods, but the model can't call the controller directly, since the controller is UI related and the model should be UI independent. If the model wishes to update the controller, an observer design pattern can be used. The controller also communicates with its view, using the outlet design pattern that allows the controller to update the view that appears on screen. When the view wants to communicate back to the controller, mainly following a user interaction with the view, the target-action pattern is used. In that case, the action performed by the user is being sent back to the controller, and there is triggers a predefined response method. 4.2 Top-Level View The top-level view of the app describes the interfaces between the different controllers of the app, each in charge of handling a specific view. For that reason, in iOS those controllers are being called view controllers. Figure 7 - Storyboard View of the App The above figure is a screenshot of the app's storyboard. The storyboard is a mechanism that was first introduced for iOS5, and supplies a graphical view of all the views in the app and the connections between all the different view controllers. The different arrows between the view controllers represent the possibilities of navigating from one view controller to the other. It can be seen that the Main Menu view controller (the second screen to the left) is the basic view controller, from which the user can navigate to all other view controllers, each representing a specific feature of the app. When moving between two features of the app, for example capturing images for reconstruction and viewing those images, the user must go through the main menu, as there is no direct navigation connection between the two view controllers responsible for those tasks. -19- In the following sub-sections we will give a detailed description of the app, by describing the design of each view and the tasks its view controller is responsible for. 4.3 Main Menu As explained shortly in the previous sub-section, the main menu is the base view of the app, from which the user navigates between each of the app's main functionalities. Figure 8 - Main Menu View 4.3.1 User Interface Connect Devices Button – launches the connection picker, which allows the user to create the Bluetooth session between the devices. Detailed description of this functionality will appear below in section 4.3.2. Calibration Button – navigates the app on both devices to the calibration view. Is enabled only after creating a connection between the devices, and if pressed without doing so, it will not navigate to the calibration screen and will display a pop-up message instead. Reconstruction Button – navigates the app on both devices to the reconstruction view, in which the user captures the images that will be reconstructed. Is enabled only after creating a connection between the devices, and if pressed without doing so, it will not navigate to the reconstruction screen and will display a pop-up message instead. Photo Library Button – navigates the app to the 3D reconstructed images table view. If pressed after a connection between the devices was established, will navigate both devices to this screen. Settings Button – navigates the app to the settings view. If pressed after a connection between the devices was established, will navigate both devices to this screen. -20- 4.3.2 Features Creating a Bluetooth session between the devices – using the Session Manager object and the provided GameKit framework, the app establishes a Bluetooth connection between the two iPhone devices. As the connect devices button is pressed, a dialogue box will be launched, and all available iPhones will appear in it. After making sure each iPhone appears in the availability list of the other device, the user taps the requested device at one of the devices and the session manager object handles the creation of a Bluetooth session. Calculating message passing delay – after creating the session, the session manager object generates a sequence of messages sent between the devices, by which the devices calculate the initial average message passing delay between the devices. This value has great significance for the accuracy of the calibration and reconstruction processes as it tries to minimize the time gap between the two images captured by the two devices whenever a capture order is given. The algorithm for calculating the delay and the other options for achieving synchronization (that were not implemented) will be discussed in detail at section 4.9.1. Only after calculating the delay, a message is displayed on screen announcing that connection was successfully established, and displaying the calculated delay value. Normally, the calculated delay value should be around 30ms. 4.4 Calibration View The calibration view enables the user to perform the entire calibration process of the stereo reconstruction system, which computes both the intrinsic and extrinsic parameters of the system. This process must be done at least once when first using the app, and then must be done again whenever the relative locations of the devices are changed. The background of the view displays the view of the iPhone's camera. Figure 9 - Calibration View -21- 4.4.1 User Interface Capture Button – triggers an image capture action by both devices. Before pressing this button, the user needs to make sure that the calibration chessboard pattern is seen completely by the cameras of both devices. After pressing the button, the captured image will be displayed with the pattern's corners highlighted (as appears in the figure above). Calibrate Button – triggers the calculation of the intrinsic and extrinsic parameters of the stereo system, based on the captured images of the chessboard pattern. Should be pressed after 5 to 15 images of the pattern were captured. This button needs to be pressed in each device individually in order to start the parameters computation process. While performing the computation, the device will display a spinning wheel on screen, and when finishing it will display a message reporting the RMS value the computation ended with. 4.4.2 Calibration Process In the figure below appears a detailed flow chart of the calibration process: Figure 10 - Calibration Process Flow Chart -22- 1) Initial State – the initial state, after the calibration view has appeared on screen, will include an active Bluetooth session between the devices, a valid message-passing delay value and an initialized camera (based on an OpenCV camera wrapper for iOS) displaying the live camera view on screen. 2) Capture – the capture button should be pressed on one of the devices, after making sure that both devices acquired a full view of the chessboard pattern. The device which has his capture button pressed sends a capture indication to the other device, then waits the precalculated delay period and captures an image. The other device receives the capture indication message and immediately captures an image. 3) Corners Extraction – both devices run a set of algorithms to detect the internal corners of the chessboard pattern and obtain the accurate pixel locations of those corners. When done, they send each other the coordinates of extracted corners and display the captured image with the corners of the pattern highlighted (see example in figure 9). 4) Multiple captures – stages 2 and 3 are performed several times. We recommend capturing 5 to 15 images from various angles, with the chessboard pattern covering a major part of the screen. Out best results were obtained when using a computer screen to display the chessboard pattern. 5) Calibrate – after capturing the required amount of images, the calibrate button can be pressed in each of the devices. This triggers a set of algorithms, using the extracted corners coordinates from images captured by both devices. Firstly, the intrinsic parameters of the device's camera are being calculated, and afterwards the extrinsic parameters of the stereo system are calculated. This process completes after saving all the calculated parameters to allow persistence between different launches of the app. At the end of the process a message is displayed announcing the RMS value that the process ended with. For obtaining high quality reconstruction later on, the process should end with a RMS value of at most 0.4 pixels. If received a higher value, the user is recommended to perform the entire calibration process again (going back to step 1). -23- 4.5 Reconstruction View The reconstruction view is presenting the user with the main feature of the app – the capability to perform a 3D reconstruction of scenes that are captured by the stereo system. Multiple images can be captured during a single visit to this view, and it must be entered only after a successful calibration process was completed (can also be done in a previous launch of the app) and after a Bluetooth session between the devices was established. Figure 11 - Reconstruction View 4.5.1 User Interface High / Low Switch – this switch enables the user to choose which stereo correspondence algorithm to use. If choosing high, this device will use the higher quality SGBM algorithm, and if choosing low, it will use the lower quality but faster BM algorithm. An explanation about the algorithms can be found in section 2. Capture Button – triggers the entire reconstruction process, which starts by capturing an image by both devices and continues with calculating the reconstructed depth scene. When the process finishes, the device displays the gray-color disparity map that was computed, before returning to the normal reconstruction view. 4.5.2 Reconstruction Process In the figure below appears a detailed flow chart of the reconstruction process: -24- , c Figure 12 - Reconstruction Process Flow Chart 1) Initial State – the initial state after the reconstruction view has appeared on screen. At this state, there exists a valid Bluetooth session between the devices and the message-passing delay was already calculated, the camera of the iPhone device was initialized and a valid calibration process was performed. 2) Loading Parameters – the devices will load the pre-computed calibration parameters. 3) Remapped Pixel Map – each device will generate an un-distorted rectified bitmap instructing how to remap every pixel of the images that will be captured before starting the stereo correspondence process. Each device computes two maps, one for remapping images it captures and one for images the other device captures and transfers to it. Calculating those maps once when this view loads, saves time from the computation that is performed after every capture of an image. -25- 4) Choosing Reconstruction Algorithm – the user chooses which stereo correspondence algorithm to use: high quality / long processing time SGBM or low quality / short processing time BM. 5) Capture – triggers the capturing of an image by both devices and the reconstruction of the selected scene. The capture button is pressed only at one of the devices, and this device sends a capture indication to the other device, then waits the pre-calculated delay period and captures an image. The other device receives the capture indication message and immediately captures an image. At the next stage, both devices send the image they captured to the other device. After obtaining both images, each device creates the undistorted rectified pair of images, and performs the stereo correspondence stage according to the algorithm that was chosen, which results in a disparity map. 6) Saving Depth Image – the depth image is being displayed on screen and saved to the iPhone's file system. 4.6 Photo Album View The photo album view displays a table containing all the reconstructed images. Each entry in the table contains a small depth map image enabling the user to identify the different images. In addition, each image is given a serial number. Figure 13 - Photo Album View 4.6.1 User Interface Table Row Selection – selecting a row navigates to the 3D interactive display of this scene using OpenGL. Table UI – basic iOS table view functionality is enabled, including scrolling through the table and deleting table entries (using the swipe gesture). -26- 4.7 Interactive Image View The interactive color image view displays the user a 3D scene in which every pixel is assigned its original color and its calculated depth. This view is displayed using OpenGL libraries, which give a framework for displaying complicated animated scenes, such as this interactive 3D scene. Figure 14 - Interactive Image View 4.7.1 User Interface Interactive Image Interface – the interactive image interface enables the user to rotate the displayed scene for obtaining great perception of the scene's depth from various angles. In addition, the user can zoom in or out of the image to adjust its size on screen. 4.8 Settings View The settings view has two different functionalities. It displays a list of attributes that needs to be set when starting to use the application. In addition, it displays several of the matrices that are computed during the calibration process. Figure 15 - Settings View -27- 4.8.1 User Interface Relative Location Switch – in this switch the user chooses the location of each device in relation to the other device (located to the left or to the right of the other device). If both devices already have an active communication, pressing the switch at one device will automatically set the opposite value to the other device. Chessboard Parameters Text Fields – there four text fields, each with a different label attached to it, that are intended for the user to enter several characteristics of the chessboard pattern that will be used for calibration. Into the first two fields, board width and board height, the user should enter the number of internal corners along the horizontal and vertical axes of the chessboard pattern. The next two fields, square width and square height, should contain the horizontal and vertical dimensions of each square in the chessboard pattern, entered in units of millimeters. If both devices already have an active connection while the values of the text fields are entered, the values entered at one device will be immediately set for the other device as well. 4.9 Main Implementation Issues In the following sub-section we will introduce two implementation issues that we had to face during the implementation of this app. We will cover in detail the solutions that were chosen, and add some arguments for and against our chosen solution. 4.9.1 Simultaneous Photo Capture One of the subtle points when implementing a stereo reconstruction system, is that any change at the absolute location of the devices (position of the handle that is fixing the devices) or at the captured scene itself between the time a first device captured the image and the time the second device captured the image, can significantly affect the quality of results. To decrease the influence of this issue, we must try that our iPhones will capture their images as simultaneously as possible. This problem is not trivially solved by using the internal time of each device, since the clocks of both devices are not synchronized to begin with. Problem: how to minimize the time gap between the capturing of images by both devices when the capture button on one of the devices is pressed. Suggested Solutions: several solutions can be used to solve this problem, and they can be divided into two groups. The first group of solutions is the one trying to set the actual time of both devices to the same value, and thus one of the devices can set a future time as the "capture time", and both devices will capture the images at that exact moment. In this group, we present two means of obtaining clock synchronization: o Web Service – the absolute time of the devices can be set using a dedicated web service, that sends its timestamp to the client that address it. The disadvantage of this method is the variance in message delays throughout the internet network, and -28- therefore the time gap between each device's request leaving the web service and until its arrival back to the device, which can be significant. o GPS – the GPS clock is very accurate and doesn't suffer from significant variance in the transition time of messages. Its drawback is that the synchronization is dependent on the fact that devices are located where they manage to receive the GPS signals. The second group contains solutions based on the average RTT of messages between the devices, and at this method the initiating device waits a time period of the average RTT/2 value after sending the capture message to the other device, and then captures the image, while the receiving device captures the image immediately as it receives the message from the initiating device. Chosen Solution: we chose to implement an average RTT calculation algorithm, and use it to set the delay period between when a device sends the capture indication to the other device, and until it captures an image by itself. The algorithm includes two stages, which are performed identically by each of the devices: 1) Initial Stage – immediately after establishing the new session, the devices send each other 10 messages. Each device measures the time between each message getting sent, and until an answer is received from the other device. Then, the average of those 10 measures is calculated, and if some measurements have a significant deviation from the average value, it is being thrown. Now, the average is being calculated again based in the remaining measures, and this value is set as the initial RTT value, from which RTT/2 is obtained to set the requested delay period at each device. 2) Refresh Stage – every time the user enters the calibration or reconstruction views, the same process of the initial stage is performed again. If the new value is different from the previous value by a time gap that exceeds a pre-defined threshold, the new value is being set as the RTT. The advantages we see in using this method is the short distance that messages have to traverse, which decrease to possible variance in message transition times, and the fact that it is not dependent on any other external service, which makes the system completely stand-alone. The disadvantage is that the achieved accuracy, although reasonably high, falls from what would be achieved if using the GPS signals. 4.9.2 Communication between Devices A key point in implementing the iPhones-based reconstruction system is to enable a two-way communication between the two devices. This communication is being used throughout the app, both for sending messages and transferring data between the devices. Problem: need to pass messages and data between the two devices. Suggested Solutions: two solutions are possible when looking at the communication protocols supported by the devices. The first is using Wi-Fi communication and the other is using Bluetooth communication. -29- Chosen Solution: we chose to use Bluetooth from several reasons. The first is that we could use an existing easy-to-use framework (GameKit) that saved us a lot of developing time and effort. The second is the fact that Bluetooth is a more simple protocol with less overhead, which therefore decreases the time gap from when the device receives a message to the time the message was processed and transferred to the operating app. The main disadvantage we encountered from using this solution is the much smaller bandwidth that Bluetooth supports (up to 100's of Kbps, compared with 10's of Mbps using Wi-Fi). This bandwidth limitation comes into effect at the reconstruction stage, where the images that were captured by the devices need to be sent to each other. -30- 5. Results In this section we will shortly discuss the results obtained by using our 3D reconstruction system, and present several examples of those results. As for the calibration process, we have managed to consistently achieve accurate calibration which concludes with a typical RMS value of around 0.2 pixels. When the calibration process completes with an RMS of 0.4 pixels or more, we recommend repeating the process as the reconstruction results later on are likely to be unsatisfying. Another important fact to note regarding the calibration process is its sensitivity to the relative locations of the iPhone devices. In order to complete a high quality calibration, the devices should be placed as close as possible, and approximately at the same height (as appears in the right side of the figure below). When placing the devices far from each other (as appears on the left side of the figure) or at different heights, the quality of calibration drops significantly. Figure 16 - Possible Settings of the Two Devices Regarding the reconstruction process, several images displaying the received results were also appearing previously throughout this report. Here, we would like to display another set of examples, and in addition, demonstrate the differences in the resulted depth scenes when using each of the two possible stereo correspondence algorithms. When looking below at the figure, it is easy to observe the difference in resulted depth maps when using each of the algorithms. This figure contains depth maps of the same captured scene. The left depth map is obtained by using the BM algorithm, while the right one is obtained by using the SGBM algorithm. It can be seen that the SGBM algorithm generates a much smoother depth map, as it adds global smoothness constraints which are not taken into account by the BM algorithm. In addition, every black pixel in the depth maps signals a pixel -31- that wasn't successfully matched by the stereo correspondence process, and the SGBM algorithm unsurprisingly manages to match more pixels than the BM algorithm. Figure 17 - Gray-Scale Depth Maps In the next figure appearing two depth maps of different scenes. The left one is of an animal model that was captured, and the right one is of a desk lamp. Figure 18 - Depth Map Examples In the next figure we can see the different phases of the same scene. On the leftmost image we see the original scene as it was captured by one of the devices. Next to it we see the depth map -32- of this scene, and to the right we see two images displaying the reconstructed scene from two different angles. Figure 19 - Different Phases of the Same Scene -33- 6. Future Directions Our project, in which a completely mobile, iPhone-based stereo reconstruction system was implemented, seems to be a great starting point for various possible future projects, which we divide into two groups. The first group contains projects for improving the performance and functionality of the system we have created. By relying on our project as the basic infrastructure, there can me many ways for trying to improve its capabilities. The most promising seems to be creating a new stereo correspondence algorithm that will be suited specifically for characteristics of iPhone devices, thus achieving much greater efficiency. Another direction may be improving the communication scheme between the devices to increase the bandwidth and shorten the duration of the reconstruction process. Finally, one may even think of finding ways for improving the calibration process, making it more accurate and allowing greater flexibility in the positioning of the two devices. The second group of follow-up projects involves with creating useful real-life applications based on the stereo reconstruction system. This mobile system not only re-creates a depth scene, but also allows us to obtain the real size of objects that can't be obtained by a simple 2D image. With this information, one may think of many implications, either in scanning specific areas or objects using the depth scene that is being obtained or in measuring the actual size of different objects. -34- 7. Summary This project has been a very challenging task. It required us to learn the basics of computer vision, which is a highly complex and deeply mathematical field. We were not asked to implement new computer vision algorithms, rather just use existing algorithms implemented in OpenCV libraries, but in order to successfully implement the 3D reconstruction system while dealing with various implementation issues, we had to gain knowledge regarding the pinhole camera model, epipolar geometry, stereo calibration and stereo correspondence. In addition, we had to get familiarized with iOS programming and the Objective-C programming language. This included not only learning the syntax of a new language, but also understanding the design models of mobile applications. The specific requirements of the implemented system didn't allow us to limit ourselves to the mere basics of iOS app development, and demanded that we deal with more complex issues such as communication between devices, synchronization between devices, controlling a two devices app from one device and persistent storage of data. The project was somewhat regarded as a proof of concept, suggested by our supervisor, claiming that the ever improving abilities of our mobile devices enables us to perform tasks that we traditionally regarded as requiring significant computing resources. When looking at the final result, we feel we did manage to prove this concept. The implemented system is completely mobile, relatively easy-to-use and performs scene reconstruction in a matter of seconds. No less of importance is the fact that it provides the user with live results, so it can be judged based on its real performance, and not only based on a set of graphs and simulated results. To conclude, we would like to thank our supervisor Aaron Wetzler and the entire staff of the Geometric Image Processing Lab for their extensive guidance and support throughout this project. -35- 8. References • Computer Vision course, Spring 2010, University of Illinois http://www.cs.illinois.edu/~dhoiem/courses/vision_spring10/lectures/ • Developing Apps for iOS, Paul Hegarty, Stanford fall 2010 course on iTunes • Multiple View Geometry in Computer Vision course, University of North Carolina http://www.cs.unc.edu/~marc/mvg/slides.html • Modeling the Pinhole camera, course lecture, University of Central Florida • Computer Vision tutorial, GIP lab • Learning OpenCV, Gary Bradski & Adrian Kaehler • Stereo Vision using the OpenCV library, Sebastian Droppelmann, Moos Hueting, Sander Latour and Martijn van der Veen -36-