Final Report

advertisement
3D Stereo Reconstruction
using iPhone Devices
Final Report
Submitted by:
Ron Slossberg
Omer Shaked
Supervised by:
Aaron Wetzler
-1-
Table of Contents
List of Figures ........................................................................................................................................... 4
Abstract.................................................................................................................................................... 5
List of Abbreviations ................................................................................................................................ 6
1.
2.
Introduction ..................................................................................................................................... 7
1.1
General Project Description..................................................................................................... 7
1.2
Programming Environment ..................................................................................................... 7
1.2.1
OpenCV Libraries ............................................................................................................. 8
1.2.2
OpenGL Libraries.............................................................................................................. 8
Theoretical Background ................................................................................................................... 9
2.1 Pinhole Camera Model .................................................................................................................. 9
2.1.1
The Pinhole Model Geometry........................................................................................ 10
2.1.2
Homogeneous Coordinates ........................................................................................... 10
2.2 Epipolar Geometry ...................................................................................................................... 11
2.3 Stereo Correspondence ............................................................................................................... 13
2.4 Reconstructed Scene ................................................................................................................... 14
3.
Basic System Functionalities .......................................................................................................... 15
4.
Software Implementation.............................................................................................................. 18
4.1
Model-View-Controller Design Pattern ................................................................................. 18
4.2
Top-Level View....................................................................................................................... 19
4.3
Main Menu ............................................................................................................................ 20
4.3.1
User Interface ................................................................................................................ 20
4.3.2
Features ......................................................................................................................... 21
4.4
Calibration View..................................................................................................................... 21
4.4.1
User Interface ................................................................................................................ 22
4.4.2
Calibration Process ........................................................................................................ 22
-2-
4.5
Reconstruction View .............................................................................................................. 24
4.5.1
User Interface ................................................................................................................ 24
4.5.2
Reconstruction Process ................................................................................................. 24
4.6
Photo Album View ................................................................................................................. 26
4.6.1 User Interface ....................................................................................................................... 26
4.7
Interactive Image View .......................................................................................................... 27
4.7.1
4.8
Settings View ......................................................................................................................... 27
4.8.1
4.9
User Interface ................................................................................................................ 27
User Interface ................................................................................................................ 28
Main Implementation Issues ................................................................................................. 28
4.9.1
Simultaneous Photo Capture ......................................................................................... 28
4.9.2
Communication between Devices ................................................................................. 29
5.
Results............................................................................................................................................ 31
6.
Future Directions ........................................................................................................................... 34
7.
Summary ........................................................................................................................................ 35
8.
References ..................................................................................................................................... 36
-3-
List of Figures
Figure 1 - Pinhole Model Geometry ...................................................................................................... 10
Figure 2 - Epipolar Geometry................................................................................................................. 11
Figure 3 - Smoothness Constraint Paths in SGBM ................................................................................. 13
Figure 4 – iPhone Devices Fixed to a Handle ......................................................................................... 15
Figure 5 - A side-view of a reconstructed scene .................................................................................... 17
Figure 6 - The MVC Design Pattern Structure........................................................................................ 18
Figure 7 - Storyboard View of the App .................................................................................................. 19
Figure 8 - Main Menu View ................................................................................................................... 20
Figure 9 - Calibration View..................................................................................................................... 21
Figure 10 - Calibration Process Flow Chart ............................................................................................ 22
Figure 11 - Reconstruction View ............................................................................................................ 24
Figure 12 - Reconstruction Process Flow Chart ..................................................................................... 25
Figure 13 - Photo Album View ............................................................................................................... 26
Figure 14 - Interactive Image View ........................................................................................................ 27
Figure 15 - Settings View ....................................................................................................................... 27
Figure 16 - Possible Settings of the Two Devices .................................................................................. 31
Figure 17 - Gray-Scale Depth Maps ....................................................................................................... 32
Figure 18 - Depth Map Examples........................................................................................................... 32
Figure 19 - Different Phases of the Same Scene.................................................................................... 33
-4-
Abstract
Stereo Reconstruction is a common method for obtaining depth information about a given
scene using 2D images of the scene taken simultaneously by two cameras from different
views. This process is done by finding corresponding objects which appear in both images and
examining their relative positions in the images, based on previous knowledge of the internal
parameters of each camera and the relative positions of both cameras. This method relies on
the same basic principle that enables our eyes to perceive depth.
In this project we have implemented a completely mobile 3D stereo reconstruction system
built from two iPhone 4S devices, by using them both to capture the images and perform all of
the required computational processes. The iPhone app we have developed enables to perform
both the calibration process for obtaining the required intrinsic and extrinsic parameters and
the stereo reconstruction process itself.
The Computer Vision algorithms we have used for both calibration and reconstruction
processes were all provided by OpenCV C++ libraries.
-5-
List of Abbreviations
MVG
CV
GL
RMS
BM
SGBM
MVC
UI
RTT
Multiple View Geometry
Computer Vision
Graphics Library
Root Mean Square
Block Matching
Semi-Global Block Matching
Model-View-Controller
User Interface
Round Trip Time
-6-
1. Introduction
Computer Vision is a field which requires performing heavy computational tasks. This is why
it traditionally required the use of a powerful backbone computer which received the captured
images as an input, and performed all those complex computations.
With the continuous improvement in performance of modern processors in general, and
processors of mobile devices in particular, applications that used to require extensive
infrastructure of computing resources, can now be performed on just a small iPhone device.
As a demonstration of those increased capabilities nowadays mobile devices posses, we
decided to build a mobile app that implements a complete stand-alone stereo reconstruction
platform from two iPhone 4S devices.
The iPhone devices perform all the required tasks for generating depth images – they capture
the images using their cameras, communicate with each other to transfer data using Bluetooth
protocol and compute by themselves the reconstructed depth images using well-known, open
source computer vision algorithms.
1.1 General Project Description
This project was divided into two main stages. In the first stage we got familiarized with
computer vision and in particular with stereo vision, by learning the basic theoretical
background of the pinhole camera model and epipolar geometry, together with performing
basic stereo calibration exercises in MATLAB using a standard camera calibration toolbox. In
addition, we have also learned the basics of Objective-C and iOS programming.
In the second stage, we have built our 3D stereo reconstruction iOS app. We started by
implementing the stereo calibration process, then moved to implementing the reconstruction
process and finally, based on OpenGL libraries, we have added an interactive 3D display of
the reconstructed scenes.
1.2 Programming Environment
Building an iOS app is done in Objective-C language using the Apple Xcode SDK.
Objective-C is an object-oriented extension of C, and thus supports all the basic
functionalities of C. In addition, we also used C++ to support the integration of OpenCV and
OpenGL libraries into our project, which are both implemented in C++.
Mobile application programming is done using the basic Model-View-Controller design
pattern. This pattern is used to separate the background computational model of the app from
the interactive views that are being displayed to the user. The model is in charge of preparing
the data required for the app to function. The controller, also referred as the view controller, is
responsible for obtaining the needed data from the model and displaying the correct views on
the screen. The view is the actual object that is presented on screen.
-7-
1.2.1 OpenCV Libraries
OpenCV is an open source library implementing computer vision functions. It provides realtime computer vision infrastructure that enables to implement relatively complicated systems,
that would require much more effort if were to be built from scratch. In this project we used
the C++ OpenCV code alongside comprehensive guides and documentation to perform all the
required CV tasks: stereo calibration using a chessboard pattern to retrieve the intrinsic and
extrinsic parameters, un-distorting and rectifying the captured images based on the parameters
that were computed during the calibration phase, and finally the stereo correspondence
process of matching pixels between the two captured images and reconstructing the depth
scene. In addition, we have used an OpenCV interface that handles the communication
between the app and the iPhone's camera.
Regarding the use of OpenCV libraries, our main goals were to successfully integrate the
implemented functions into our code, and mainly enable reliable data flow between the
different functions, together with providing the ability of persistently storing the relevant data
throughout multiple runs of the app.
1.2.2 OpenGL Libraries
The powerful graphics library OpenGL (Open Graphics Language) is a cross-language, multiplatform API for rendering 2D and 3D computer graphics. The iOS API includes a
lightweight version of this library called OpenGL ES (for embedded systems). This library
allowed us the flexibility and performance we needed in order to render the 3D scene as 3D
surface (triangle mesh) onto the screen of the iPhone. The library taps the power of the
iPhone’s hardware graphics acceleration for the task of rendering and manipulating the scene.
-8-
2. Theoretical Background
2.1 Pinhole Camera Model
3D projection is any method of mapping three-dimensional points to a two-dimensional plane.
Every imaging device whether artificial (camera) or natural (eye) projects the 3D world onto a
2D image. The mathematical formulation of this projection is called the projection model.
Projection models are essential mathematical tools when dealing with photographs and when
displaying 3D images on a computer monitor.
The simplest and most general projection model is the Pinhole Camera Model. This
model describes the mathematical relationship between the coordinates of a 3D world point
and its 2D projection onto the image plane of a pinhole camera. A pinhole camera is a
theoretical “ideal” imaging device where the camera aperture is described as an infinitesimally
small point and no lenses are used to focus light.
In reality there are no ideal imaging devices. For instance, the model does not take into
account many imaging artifacts as geometric distortions or blurring of unfocused objects
caused by lenses and finite sized apertures. The model also doesn’t take into consideration the
fact that the image plane is made up of discrete light detectors (pixels). This means that the
pinhole camera model can only be used as a first order approximation of the mapping from a
3D scene to a 2D image. The validity of this approximation depends on the imaging system
(camera) being used and even varies within the image itself (lens distortions are often more
pronounced near the edges of an image).
As mentioned earlier, some of the abovementioned artifacts become negligible when using a
high quality camera, and others like lens distortion can be corrected by modeling the
distortion effect and then reversing it through a coordinate transformation. This means that the
pinhole camera model holds for most practical applications in computer vision and computer
graphics.
-9-
2.1.1 The Pinhole Model Geometry
Figure 1 - Pinhole Model Geometry
The figure above depicts the pinhole model geometry:
 C is the principal point.
 f is the focal length.
 u, v are the coordinates in the image plane.
 x, y, z are the coordinates in the real world.
 I is the image plane.
 P is some point in the x,y,z system and p is its projection in the image plane. The
projection is given by the intersection of the line connecting the principal point to the 3d
point with the image plane.
When given a point in 3D space we wish to find its projection in the image plane. From
triangle resemblance we can deduce that the point p in the image plane is given by the formula
u
x
f x

u 
f
z
z
u 
 
 
v
v
y
f y


v
f
z
z
 x
f


z
 y
2.1.2 Homogeneous Coordinates
To further simplify the notation we aspire to write this as a linear operator represented by a
matrix operating on a vector. To achieve this we must switch to homogenous coordinates.
-10-
The image coordinates are represented
u
 
 u '  w 
by    v 
 v'   w 
 w  
 
 1 
 
while the real world coordinates are
 x
 x
 u '  f 0 0 0   
 
  
 y
represented by  y  . This allows us to use the following matrix equation  v '    0 f 0 0     .
z
 w   0 0 1 0   z 
  

 
1
1
To find the actual u,v in the image plane one must divide by the 3rd coordinate w. This matrix
is called the camera matrix, and it represents the projective transform as a linear operator on
the homogenous coordinates. In reality the camera matrix has some more factors added to it
which account for imperfections in the real camera:
 f x  u0 0 


C   0 f y v0 0 
 0 0 1 0


Where f x , f y represent the different focal lengths in the u, v directions,  represents the skew
of the imaging plane and u0 , v0 represent the location of the principal point in the u, v plane.
2.2 Epipolar Geometry
Epipolar geometry is the geometry of stereo vision. When two cameras view a 3D scene from
two distinct positions, there are a number of geometric relations between the 3D points and
their projections onto the 2D images that lead to constraints between the image points. These
relations are derived based on the assumption that the cameras can be approximated by
the pinhole camera model.
Figure 2 - Epipolar Geometry
-11-
In stereo photography we use two cameras to capture (project) a 3D scene from two distinct
locations. We assume these locations (relative to each other) are known in advance. Each
camera has its own camera matrix which is used to create the projected image. Using the
information in both images combined with the constraints of epipolar geometry we can find
the distance in space of a point in the image, essentially recreating the depth of the image.
Since the centers of projection of the cameras are distinct, each center of projection projects
onto a distinct point into the other camera's image plane. These two image points are denoted
by eL and eR and are called epipolar points (epipoles in short). Both epipoles in their
respective image planes and both centers of projection OL and OR lie on a single 3D line as
seen in the figure above.
The line OL  X is seen by the left camera as a point because it is directly in line with that
camera's center of projection. However, the right camera sees this line as a line in its image
plane. This line shown in red color in the figure above is called an epipolar line.
Symmetrically, the line OR  X , seen by the right camera as a point, is seen as epipolar
line by the left camera.
An epipolar line is a function of the 3D point X, i.e. there is a set of epipolar lines in both
images if we allow X to vary over all 3D points. Since the 3D line OL  X passes through the
center of projection OL , the corresponding epipolar line in the right image must pass through
the epipole eR and vice versa. This means that all epipolar lines in one image must intersect
the epipolar point of that image. In fact, any line which intersects with the epipolar point is an
epipolar line since it can be derived from some 3D point X.
If the relative translation and rotation of the two cameras is known, the corresponding epipolar
geometry leads to two important observations:
1) If the projection point X L is known, then the epipolar line eR  X R is known and the
point X projects into the right image, on a point X R which must lie on this particular
epipolar line. This means that for each point observed in one image the same point must
be observed in the other image on a known epipolar line. This provides an epipolar
constraint which corresponding image points must satisfy and it means that it is possible
to test if two points really correspond to the same 3D point.
2) If the points X L and X R are known, their projection lines are also known. If the two
image points correspond to the same 3D point X the projection lines must intersect
precisely at X . This means that X can be calculated from the coordinates of the two
image points, a process called triangulation.
The epipolar constraint in stereo images can be described using a vector equation. We denote
F the fundamental matrix. F is a 3X3 Matrix which relates corresponding points in image
pairs. Given a point x in one image, the value F  x gives an epipolar line on which x must lie
-12-
in the corresponding image. This gives rise to the constraint x 'T  F  x  0 . Since x’ lies on the
epipolar line defined by F  x this equation must hold for every x and x’.
2.3 Stereo Correspondence
As previously shown, two images of the same scene must comply with the epipolar constraint
which is mathematically formulated as x 'T  F  x  0 . Also, we saw that given the
corresponding matched coordinates in two images we can triangulate the distance of the point
from the camera. From this we deduce that if we have a method to match each point in one
image to a point in the second image, we would be able to triangulate the distance for every
one of the matched points.
The process of finding matches for one image in another is called image correspondence, and
in our case since we are performing stereo imaging it is called stereo correspondence. There
are many methods for performing stereo correspondence, each having its own advantages and
drawbacks. We chose to use two algorithms that were offered by OpenCV which gave us
good results and were fairly efficient:
 Block Matching – This method is the most efficient from the methods available in
OpenCV but did not give us very good results. The algorithm looks at the block
surrounding a pixel and finds correspondence to blocks along the corresponding epipolar
line. Once the best correspondence is found the disparity is recorded and then later can be
translated into distance. This method is a local method which has no consideration for
global constraints such as image smoothness etc.

Semi Global Block Matching – This method was noticeably less efficient than the BM
algorithm, however, it produced much better results which justified the extra complexity.
This method imposes a set of smoothness constraints into the matching process. The
constraints are added by introducing a penalty for large difference in disparity between
neighboring pixels along lines in the picture which intersect at a given pixel (illustration
in fig 2.3). The disparity is an optimization of the best local correspondence with the
lowest global smoothness cost.
Figure 3 - Smoothness Constraint Paths in SGBM
-13-
2.4 Reconstructed Scene
The output of the stereo correspondence algorithms is a disparity map. This is a gray level
image that encodes the disparity (horizontal distance between matched points in both images)
as a number between 0-255.
Knowledge of disparity can be used in further extraction of information from stereo images.
In our case, the disparity is used in order to carry out a depth/distance calculation. Disparity
and distance from the cameras are negatively correlated. As the distance from the cameras
increases, the disparity decreases. This allows for depth perception in stereo images. Using
geometry and algebra, the points that appear in the 2D stereo images can be mapped as
coordinates in 3D space.
Consider recovering the position of P from its projections pl and pr
xl  f 
Xl
x Z
X
x Z
 X l  l l , xr  f  r  X r  r r
Zl
f
Zr
f
In general, the two cameras are related by the following transformation Pr  R  Pl  T  where T
and R are the rotation matrix and the translation vector respectively between the two cameras.
x Z
x Z
T f
T  r
Z 
Using Z r  Zl  Z and X r  X l  T we have l
where d  xl  xr
f
f
d
is the disparity.
Once the image is re-projected into real world coordinates, we have for each pixel its original
(x,y,z) position. Additionally, we have a 3D color coordinate (RGB) for each pixel as well.
All this data is saved for later use in the 3Dscene viewer. The viewer takes the 3D coordinates
and 3D color coordinate for each pixel and stuffs them into an OpenCV vertex array. The
vertex array is then rendered as a triangle mesh to give a representation of the scene as a sort
of 3D surface, which can be rotated by touching the screen.
-14-
3. Basic System Functionalities
The system that was implemented has several basic functionalities that enable it to perform
the complete stereo reconstruction process. In this section, we will give a short description of
each stage in the process of generating a 3D reconstructed image. A detailed description of the
implemented software itself will be given in section 4.
First of all, after fixing the two iPhone devices to a dedicated handle, a Bluetooth connection
is established between the two devices. From that point on, the user can control the entire app
functionality from just a single device (which automatically sends the relevant commands to
the other device), and thus prevented from having to juggle between the two devices.
After that, we move on to performing the calibration process. It should be noted that the
calibration process is not needed to be done before any activation of the app, but should be
done only after changing the relative positions of the devices. If both devices remain fixed to a
handle between several runs of the app, the calibration should be done just once at the
beginning, and from there on the computed calibration parameters remain stored inside the
app.
Before running the calibration itself, we need to enter a few parameters describing the
chessboard pattern we are going to use for calibration. Those parameters, which include the
number of internal corners on both axes of the chessboard and the dimensions of each square
in the chessboard are of course entered on one device, and are automatically transferred to the
other device as well. In addition, we need to set a parameter that notes whether this device is
the left device or right device relatively to the other device.
Figure 4 – iPhone Devices Fixed to a Handle
-15-
After entering the parameters, we can start performing the calibration. The devices are held in
front of the chessboard pattern, and the user will capture several photos of the pattern (usually
between 5-15 photos). Before capturing each photo, the user must make sure that the pattern is
seen by the cameras of both devices, and this is easily done as the iPhones are displaying in
real-time the current view of the iPhone's camera throughout the calibration process (same
view that appears when taking a regular photo with an iPhone devices). The user presses the
capture button at only one of the devices, and the app is in charge of synchronizing the
devices to both capture photos at approximately the same time. After taking each photo, the
app will shortly display the captured image with the extracted chessboard corners highlighted,
before returning back to the calibration screen.
When done capturing the required amount of photos, the user presses the calibration button to
start the calibration process. The app will use several OpenCV algorithms to calculate all the
intrinsic and extrinsic parameters required later for the reconstruction stage. A successful
calibration should end with an RMS value of about ~0.2 pixels.
As the calibration stage is completed, we can move on to using the main feature of the app
and start capturing images that will go through a 3D reconstruction process. As the user enters
the reconstruction screen, the app immediately calculates the remapping that will be done for
each pixel in the images to be captured to enable un-distortion and rectification of every
captured image. This process is done automatically only once, when entering the
reconstruction screen, thus saving time of the reconstruction process itself that is performed
after every capture of an image (as multiple images can be captured over a single visit to the
reconstruction screen).
Before starting to capture images, the user can choose whether to use a high or low quality of
reconstruction. If choosing low, the app will use a Block Matching algorithm to compute the
stereo correspondence between the images. If choosing high, the app will use the SGBM
algorithm instead (for a detailed description of the algorithms and the stereo correspondence
process, go to section 2). Even though the BM algorithm is significantly faster, the results of
the SGBM algorithm are of much higher quality, so the recommendation is to set the quality
to high. If one would like to compare the results from the two algorithms, a recommended
setting would be to assign one device to high quality and the other to low quality.
Now, the user can start capturing images. As with the calibration process, the capture button is
only pressed at one device, and this device orders the other device to capture an image as well.
After each capture, each device sends the image it captured to the other device. Then, each
device uses the previously calculated map to perform un-distortion and rectification of the
images, before sending them to a set of OpenCV functions. Those functions compute the
disparity map of the captured scene, and when computation is finished, the grayscale depth
image is displayed on screen for a short period.
After capturing a single image or a set of images, the user can go into the photo album to view
the results of the reconstruction process. At the main table, appears a list of all the
-16-
reconstructed images that are stored by the app. Each entry contains a small image of the
scene's depth map, intended to help the user choose which image he likes to watch, and also
estimate the quality of reconstruction. When choosing a specific entry in the table, the user
will get a color 3D interactive point-cloud display of the scene, which combines the original
images and the calculated disparity values. This display can be rotated to different angles of
view that allows observing the actual perception of depth that was obtained for this specific
image.
Figure 5 - A Side View of a Reconstructed Scene
-17-
4. Software Implementation
A mobile app implementation is of its nature highly view-driven, and this is why the most
commonly used design pattern for such apps is the Model-View-Controller design pattern. In
the same manner, in this chapter we will describe the implemented software by presenting
each of its views, and elaborating on the functionalities of each such view. At the end of this
section we will also describe several implementation issues we had to face and the solution we
have chosen to deal with those issues.
4.1 Model-View-Controller Design Pattern
The MVC design pattern, which is almost the only reasonable way to develop a mobile app,
assigns each object a role – model, view or controller. In addition, it also determines the way
these three types of objects communicate with each other.
Figure 6 - The MVC Design Pattern Structure
The model is actually the brain of the app. It performs all the required background logic that
allows the app to supply the needed functionality, but has no effect on the way the app is
displayed to the user on screen. The ideal model would actually have no notion of the view
that is actually displaying its data, and thus can supply data to multiple views in the app. The
model is also in charge of holding the persistent state of the app, since the controller and the
view frequently change during the activation of an app, and the persistency of the app is
mainly maintained by the persistency of the data in the model.
The view object describes what the user sees on screen. It has no notion of the model which
holds the data that it presents. The view has two main functionalities: it knows how to draw
itself when it is set to appear on screen, and it knows how to respond to various user actions.
The controller is basically the intermediary between the model and the view. The controller is
in charge of passing data between the model and the view, thus telling the view how to
display itself and telling the model which tasks to perform. In addition, the model handles the
lifecycles of different views, by deciding which view will appear on screen at any given time.
As explained in brief in each object's description, the communication between the objects is
also defined by this design pattern. The model and the view never speak directly with each
-18-
other. The controller communicates directly with the model by calling its public methods, but
the model can't call the controller directly, since the controller is UI related and the model
should be UI independent. If the model wishes to update the controller, an observer design
pattern can be used. The controller also communicates with its view, using the outlet design
pattern that allows the controller to update the view that appears on screen. When the view
wants to communicate back to the controller, mainly following a user interaction with the
view, the target-action pattern is used. In that case, the action performed by the user is being
sent back to the controller, and there is triggers a predefined response method.
4.2 Top-Level View
The top-level view of the app describes the interfaces between the different controllers of the
app, each in charge of handling a specific view. For that reason, in iOS those controllers are
being called view controllers.
Figure 7 - Storyboard View of the App
The above figure is a screenshot of the app's storyboard. The storyboard is a mechanism that
was first introduced for iOS5, and supplies a graphical view of all the views in the app and the
connections between all the different view controllers.
The different arrows between the view controllers represent the possibilities of navigating
from one view controller to the other. It can be seen that the Main Menu view controller (the
second screen to the left) is the basic view controller, from which the user can navigate to all
other view controllers, each representing a specific feature of the app. When moving between
two features of the app, for example capturing images for reconstruction and viewing those
images, the user must go through the main menu, as there is no direct navigation connection
between the two view controllers responsible for those tasks.
-19-
In the following sub-sections we will give a detailed description of the app, by describing the
design of each view and the tasks its view controller is responsible for.
4.3 Main Menu
As explained shortly in the previous sub-section, the main menu is the base view of the app,
from which the user navigates between each of the app's main functionalities.
Figure 8 - Main Menu View
4.3.1 User Interface
 Connect Devices Button – launches the connection picker, which allows the user to create
the Bluetooth session between the devices. Detailed description of this functionality will
appear below in section 4.3.2.
 Calibration Button – navigates the app on both devices to the calibration view. Is enabled
only after creating a connection between the devices, and if pressed without doing so, it
will not navigate to the calibration screen and will display a pop-up message instead.
 Reconstruction Button – navigates the app on both devices to the reconstruction view, in
which the user captures the images that will be reconstructed. Is enabled only after creating
a connection between the devices, and if pressed without doing so, it will not navigate to
the reconstruction screen and will display a pop-up message instead.
 Photo Library Button – navigates the app to the 3D reconstructed images table view. If
pressed after a connection between the devices was established, will navigate both devices
to this screen.
 Settings Button – navigates the app to the settings view. If pressed after a connection
between the devices was established, will navigate both devices to this screen.
-20-
4.3.2 Features
 Creating a Bluetooth session between the devices – using the Session Manager object
and the provided GameKit framework, the app establishes a Bluetooth connection between
the two iPhone devices. As the connect devices button is pressed, a dialogue box will be
launched, and all available iPhones will appear in it. After making sure each iPhone
appears in the availability list of the other device, the user taps the requested device at one
of the devices and the session manager object handles the creation of a Bluetooth session.
 Calculating message passing delay – after creating the session, the session manager
object generates a sequence of messages sent between the devices, by which the devices
calculate the initial average message passing delay between the devices. This value has
great significance for the accuracy of the calibration and reconstruction processes as it tries
to minimize the time gap between the two images captured by the two devices whenever a
capture order is given. The algorithm for calculating the delay and the other options for
achieving synchronization (that were not implemented) will be discussed in detail at
section 4.9.1. Only after calculating the delay, a message is displayed on screen
announcing that connection was successfully established, and displaying the calculated
delay value. Normally, the calculated delay value should be around 30ms.
4.4
Calibration View
The calibration view enables the user to perform the entire calibration process of the stereo
reconstruction system, which computes both the intrinsic and extrinsic parameters of the
system. This process must be done at least once when first using the app, and then must be
done again whenever the relative locations of the devices are changed.
The background of the view displays the view of the iPhone's camera.
Figure 9 - Calibration View
-21-
4.4.1 User Interface
 Capture Button – triggers an image capture action by both devices. Before pressing this
button, the user needs to make sure that the calibration chessboard pattern is seen
completely by the cameras of both devices. After pressing the button, the captured image
will be displayed with the pattern's corners highlighted (as appears in the figure above).
 Calibrate Button – triggers the calculation of the intrinsic and extrinsic parameters of the
stereo system, based on the captured images of the chessboard pattern. Should be pressed
after 5 to 15 images of the pattern were captured. This button needs to be pressed in each
device individually in order to start the parameters computation process. While performing
the computation, the device will display a spinning wheel on screen, and when finishing it
will display a message reporting the RMS value the computation ended with.
4.4.2 Calibration Process
In the figure below appears a detailed flow chart of the calibration process:
Figure 10 - Calibration Process Flow Chart
-22-
1) Initial State – the initial state, after the calibration view has appeared on screen, will
include an active Bluetooth session between the devices, a valid message-passing delay
value and an initialized camera (based on an OpenCV camera wrapper for iOS)
displaying the live camera view on screen.
2) Capture – the capture button should be pressed on one of the devices, after making sure
that both devices acquired a full view of the chessboard pattern. The device which has his
capture button pressed sends a capture indication to the other device, then waits the precalculated delay period and captures an image. The other device receives the capture
indication message and immediately captures an image.
3) Corners Extraction – both devices run a set of algorithms to detect the internal corners of
the chessboard pattern and obtain the accurate pixel locations of those corners. When
done, they send each other the coordinates of extracted corners and display the captured
image with the corners of the pattern highlighted (see example in figure 9).
4) Multiple captures – stages 2 and 3 are performed several times. We recommend capturing
5 to 15 images from various angles, with the chessboard pattern covering a major part of
the screen. Out best results were obtained when using a computer screen to display the
chessboard pattern.
5) Calibrate – after capturing the required amount of images, the calibrate button can be
pressed in each of the devices. This triggers a set of algorithms, using the extracted
corners coordinates from images captured by both devices. Firstly, the intrinsic
parameters of the device's camera are being calculated, and afterwards the extrinsic
parameters of the stereo system are calculated. This process completes after saving all the
calculated parameters to allow persistence between different launches of the app. At the
end of the process a message is displayed announcing the RMS value that the process
ended with. For obtaining high quality reconstruction later on, the process should end
with a RMS value of at most 0.4 pixels. If received a higher value, the user is
recommended to perform the entire calibration process again (going back to step 1).
-23-
4.5 Reconstruction View
The reconstruction view is presenting the user with the main feature of the app – the
capability to perform a 3D reconstruction of scenes that are captured by the stereo system.
Multiple images can be captured during a single visit to this view, and it must be entered only
after a successful calibration process was completed (can also be done in a previous launch of
the app) and after a Bluetooth session between the devices was established.
Figure 11 - Reconstruction View
4.5.1 User Interface
 High / Low Switch – this switch enables the user to choose which stereo correspondence
algorithm to use. If choosing high, this device will use the higher quality SGBM algorithm,
and if choosing low, it will use the lower quality but faster BM algorithm. An explanation
about the algorithms can be found in section 2.
 Capture Button – triggers the entire reconstruction process, which starts by capturing an
image by both devices and continues with calculating the reconstructed depth scene. When
the process finishes, the device displays the gray-color disparity map that was computed,
before returning to the normal reconstruction view.
4.5.2 Reconstruction Process
In the figure below appears a detailed flow chart of the reconstruction process:
-24-
,
c
Figure 12 - Reconstruction Process Flow Chart
1) Initial State – the initial state after the reconstruction view has appeared on screen. At this
state, there exists a valid Bluetooth session between the devices and the message-passing
delay was already calculated, the camera of the iPhone device was initialized and a valid
calibration process was performed.
2) Loading Parameters – the devices will load the pre-computed calibration parameters.
3) Remapped Pixel Map – each device will generate an un-distorted rectified bitmap
instructing how to remap every pixel of the images that will be captured before starting
the stereo correspondence process. Each device computes two maps, one for remapping
images it captures and one for images the other device captures and transfers to it.
Calculating those maps once when this view loads, saves time from the computation that
is performed after every capture of an image.
-25-
4) Choosing Reconstruction Algorithm – the user chooses which stereo correspondence
algorithm to use: high quality / long processing time SGBM or low quality / short
processing time BM.
5) Capture – triggers the capturing of an image by both devices and the reconstruction of the
selected scene. The capture button is pressed only at one of the devices, and this device
sends a capture indication to the other device, then waits the pre-calculated delay period
and captures an image. The other device receives the capture indication message and
immediately captures an image. At the next stage, both devices send the image they
captured to the other device. After obtaining both images, each device creates the undistorted rectified pair of images, and performs the stereo correspondence stage according
to the algorithm that was chosen, which results in a disparity map.
6) Saving Depth Image – the depth image is being displayed on screen and saved to the
iPhone's file system.
4.6 Photo Album View
The photo album view displays a table containing all the reconstructed images. Each entry in
the table contains a small depth map image enabling the user to identify the different images.
In addition, each image is given a serial number.
Figure 13 - Photo Album View
4.6.1 User Interface
 Table Row Selection – selecting a row navigates to the 3D interactive display of this scene
using OpenGL.
 Table UI – basic iOS table view functionality is enabled, including scrolling through the
table and deleting table entries (using the swipe gesture).
-26-
4.7 Interactive Image View
The interactive color image view displays the user a 3D scene in which every pixel is assigned
its original color and its calculated depth. This view is displayed using OpenGL libraries,
which give a framework for displaying complicated animated scenes, such as this interactive
3D scene.
Figure 14 - Interactive Image View
4.7.1 User Interface
 Interactive Image Interface – the interactive image interface enables the user to rotate the
displayed scene for obtaining great perception of the scene's depth from various angles. In
addition, the user can zoom in or out of the image to adjust its size on screen.
4.8 Settings View
The settings view has two different functionalities. It displays a list of attributes that needs to
be set when starting to use the application. In addition, it displays several of the matrices that
are computed during the calibration process.
Figure 15 - Settings View
-27-
4.8.1 User Interface
 Relative Location Switch – in this switch the user chooses the location of each device in
relation to the other device (located to the left or to the right of the other device). If both
devices already have an active communication, pressing the switch at one device will
automatically set the opposite value to the other device.
 Chessboard Parameters Text Fields – there four text fields, each with a different label
attached to it, that are intended for the user to enter several characteristics of the
chessboard pattern that will be used for calibration. Into the first two fields, board width
and board height, the user should enter the number of internal corners along the horizontal
and vertical axes of the chessboard pattern. The next two fields, square width and square
height, should contain the horizontal and vertical dimensions of each square in the
chessboard pattern, entered in units of millimeters. If both devices already have an active
connection while the values of the text fields are entered, the values entered at one device
will be immediately set for the other device as well.
4.9 Main Implementation Issues
In the following sub-section we will introduce two implementation issues that we had to face
during the implementation of this app. We will cover in detail the solutions that were chosen,
and add some arguments for and against our chosen solution.
4.9.1 Simultaneous Photo Capture
One of the subtle points when implementing a stereo reconstruction system, is that any change
at the absolute location of the devices (position of the handle that is fixing the devices) or at
the captured scene itself between the time a first device captured the image and the time the
second device captured the image, can significantly affect the quality of results. To decrease
the influence of this issue, we must try that our iPhones will capture their images as
simultaneously as possible. This problem is not trivially solved by using the internal time of
each device, since the clocks of both devices are not synchronized to begin with.
 Problem: how to minimize the time gap between the capturing of images by both devices
when the capture button on one of the devices is pressed.
 Suggested Solutions: several solutions can be used to solve this problem, and they can be
divided into two groups. The first group of solutions is the one trying to set the actual time
of both devices to the same value, and thus one of the devices can set a future time as the
"capture time", and both devices will capture the images at that exact moment. In this
group, we present two means of obtaining clock synchronization:
o Web Service – the absolute time of the devices can be set using a dedicated web
service, that sends its timestamp to the client that address it. The disadvantage of this
method is the variance in message delays throughout the internet network, and
-28-
therefore the time gap between each device's request leaving the web service and until
its arrival back to the device, which can be significant.
o GPS – the GPS clock is very accurate and doesn't suffer from significant variance in
the transition time of messages. Its drawback is that the synchronization is dependent
on the fact that devices are located where they manage to receive the GPS signals.
The second group contains solutions based on the average RTT of messages between the
devices, and at this method the initiating device waits a time period of the average RTT/2
value after sending the capture message to the other device, and then captures the image,
while the receiving device captures the image immediately as it receives the message from
the initiating device.
 Chosen Solution: we chose to implement an average RTT calculation algorithm, and use it
to set the delay period between when a device sends the capture indication to the other
device, and until it captures an image by itself. The algorithm includes two stages, which
are performed identically by each of the devices:
1) Initial Stage – immediately after establishing the new session, the devices send each
other 10 messages. Each device measures the time between each message getting sent,
and until an answer is received from the other device. Then, the average of those 10
measures is calculated, and if some measurements have a significant deviation from the
average value, it is being thrown. Now, the average is being calculated again based in
the remaining measures, and this value is set as the initial RTT value, from which
RTT/2 is obtained to set the requested delay period at each device.
2) Refresh Stage – every time the user enters the calibration or reconstruction views, the
same process of the initial stage is performed again. If the new value is different from
the previous value by a time gap that exceeds a pre-defined threshold, the new value is
being set as the RTT.
The advantages we see in using this method is the short distance that messages have to
traverse, which decrease to possible variance in message transition times, and the fact that
it is not dependent on any other external service, which makes the system completely
stand-alone. The disadvantage is that the achieved accuracy, although reasonably high,
falls from what would be achieved if using the GPS signals.
4.9.2 Communication between Devices
A key point in implementing the iPhones-based reconstruction system is to enable a two-way
communication between the two devices. This communication is being used throughout the
app, both for sending messages and transferring data between the devices.
 Problem: need to pass messages and data between the two devices.
 Suggested Solutions: two solutions are possible when looking at the communication
protocols supported by the devices. The first is using Wi-Fi communication and the other is
using Bluetooth communication.
-29-
 Chosen Solution: we chose to use Bluetooth from several reasons. The first is that we
could use an existing easy-to-use framework (GameKit) that saved us a lot of developing
time and effort. The second is the fact that Bluetooth is a more simple protocol with less
overhead, which therefore decreases the time gap from when the device receives a message
to the time the message was processed and transferred to the operating app. The main
disadvantage we encountered from using this solution is the much smaller bandwidth that
Bluetooth supports (up to 100's of Kbps, compared with 10's of Mbps using Wi-Fi). This
bandwidth limitation comes into effect at the reconstruction stage, where the images that
were captured by the devices need to be sent to each other.
-30-
5. Results
In this section we will shortly discuss the results obtained by using our 3D reconstruction
system, and present several examples of those results.
As for the calibration process, we have managed to consistently achieve accurate calibration
which concludes with a typical RMS value of around 0.2 pixels. When the calibration process
completes with an RMS of 0.4 pixels or more, we recommend repeating the process as the
reconstruction results later on are likely to be unsatisfying.
Another important fact to note regarding the calibration process is its sensitivity to the relative
locations of the iPhone devices. In order to complete a high quality calibration, the devices
should be placed as close as possible, and approximately at the same height (as appears in the
right side of the figure below). When placing the devices far from each other (as appears on
the left side of the figure) or at different heights, the quality of calibration drops significantly.
Figure 16 - Possible Settings of the Two Devices
Regarding the reconstruction process, several images displaying the received results were also
appearing previously throughout this report. Here, we would like to display another set of
examples, and in addition, demonstrate the differences in the resulted depth scenes when
using each of the two possible stereo correspondence algorithms.
When looking below at the figure, it is easy to observe the difference in resulted depth maps
when using each of the algorithms. This figure contains depth maps of the same captured
scene. The left depth map is obtained by using the BM algorithm, while the right one is
obtained by using the SGBM algorithm. It can be seen that the SGBM algorithm generates a
much smoother depth map, as it adds global smoothness constraints which are not taken into
account by the BM algorithm. In addition, every black pixel in the depth maps signals a pixel
-31-
that wasn't successfully matched by the stereo correspondence process, and the SGBM
algorithm unsurprisingly manages to match more pixels than the BM algorithm.
Figure 17 - Gray-Scale Depth Maps
In the next figure appearing two depth maps of different scenes. The left one is of an animal
model that was captured, and the right one is of a desk lamp.
Figure 18 - Depth Map Examples
In the next figure we can see the different phases of the same scene. On the leftmost image we
see the original scene as it was captured by one of the devices. Next to it we see the depth map
-32-
of this scene, and to the right we see two images displaying the reconstructed scene from two
different angles.
Figure 19 - Different Phases of the Same Scene
-33-
6. Future Directions
Our project, in which a completely mobile, iPhone-based stereo reconstruction system was
implemented, seems to be a great starting point for various possible future projects, which we
divide into two groups.
The first group contains projects for improving the performance and functionality of the
system we have created. By relying on our project as the basic infrastructure, there can me
many ways for trying to improve its capabilities. The most promising seems to be creating a
new stereo correspondence algorithm that will be suited specifically for characteristics of
iPhone devices, thus achieving much greater efficiency. Another direction may be improving
the communication scheme between the devices to increase the bandwidth and shorten the
duration of the reconstruction process. Finally, one may even think of finding ways for
improving the calibration process, making it more accurate and allowing greater flexibility in
the positioning of the two devices.
The second group of follow-up projects involves with creating useful real-life applications
based on the stereo reconstruction system. This mobile system not only re-creates a depth
scene, but also allows us to obtain the real size of objects that can't be obtained by a simple
2D image. With this information, one may think of many implications, either in scanning
specific areas or objects using the depth scene that is being obtained or in measuring the
actual size of different objects.
-34-
7. Summary
This project has been a very challenging task. It required us to learn the basics of computer
vision, which is a highly complex and deeply mathematical field. We were not asked to
implement new computer vision algorithms, rather just use existing algorithms implemented
in OpenCV libraries, but in order to successfully implement the 3D reconstruction system
while dealing with various implementation issues, we had to gain knowledge regarding the
pinhole camera model, epipolar geometry, stereo calibration and stereo correspondence.
In addition, we had to get familiarized with iOS programming and the Objective-C
programming language. This included not only learning the syntax of a new language, but
also understanding the design models of mobile applications. The specific requirements of the
implemented system didn't allow us to limit ourselves to the mere basics of iOS app
development, and demanded that we deal with more complex issues such as communication
between devices, synchronization between devices, controlling a two devices app from one
device and persistent storage of data.
The project was somewhat regarded as a proof of concept, suggested by our supervisor,
claiming that the ever improving abilities of our mobile devices enables us to perform tasks
that we traditionally regarded as requiring significant computing resources.
When looking at the final result, we feel we did manage to prove this concept. The
implemented system is completely mobile, relatively easy-to-use and performs scene
reconstruction in a matter of seconds. No less of importance is the fact that it provides the user
with live results, so it can be judged based on its real performance, and not only based on a set
of graphs and simulated results.
To conclude, we would like to thank our supervisor Aaron Wetzler and the entire staff of the
Geometric Image Processing Lab for their extensive guidance and support throughout this
project.
-35-
8. References
• Computer Vision course, Spring 2010, University of Illinois
http://www.cs.illinois.edu/~dhoiem/courses/vision_spring10/lectures/
• Developing Apps for iOS, Paul Hegarty, Stanford fall 2010 course on iTunes
• Multiple View Geometry in Computer Vision course, University of North Carolina
http://www.cs.unc.edu/~marc/mvg/slides.html
• Modeling the Pinhole camera, course lecture, University of Central Florida
• Computer Vision tutorial, GIP lab
• Learning OpenCV, Gary Bradski & Adrian Kaehler
• Stereo Vision using the OpenCV library, Sebastian Droppelmann, Moos Hueting, Sander
Latour and Martijn van der Veen
-36-
Download