A NEW VISION & INERTIAL POSE ESTIMATION SYSTEM FOR HANDHELD AUGMENTED REALITY —–

advertisement
A NEW VISION & INERTIAL POSE ESTIMATION
SYSTEM FOR HANDHELD AUGMENTED
REALITY
—–
DESIGN,
IMPLEMENTATION
& TESTING
John M. Steinbis
Department of Engineering
Colorado School of Mines
March 19, 2008
Abstract
The research contained within this thesis may be summarized as the design, implementation, and testing of a new vision and inertial pose estimation system for registration
within handheld augmented reality (AR). Additionally, a real-time AR testbed was constructed from off-the-shelf components and free or open source software libraries. It runs at
frame rate, 30 Hz, and maintains highly accurate dynamic overlay. Significant contributions
were made in pose estimation and system architecture.
AR is a combination of computer generated stimulus with the real-world which aims
to augment human experience or perception. Most often this stimulus consists only of visual data, but audio and haptics may also be used. The use of three different 3D models
demonstrates AR capabilities within four application areas: C130 Hercules - training, Kuka
manipulator - education and teleoperation support, Matlab peaks plot - complex data visualization.
Traffic cones are utilized as 3D fiducials within a potentially large work volume. Planar
patterns unnecessarily constrain user motion. The system is scalable because any size cone
can be used, 4” to 36”. The cones are also inexpensive, durable, stackable, and may be
quickly placed for outdoor use.
A quaternion-based EKF was implemented to filter sensor measurements and avoid
singularities found in other rotation parameterizations. A new object space pose uncertainty
algorithm with quaternion-normality constraint was designed and verified with a Monte Carlo
method. This uncertainty calculation is needed for systems that rely on a quaternion-based
estimator and object space algorithms such as the Orthogonal Iteration Algorithm (OIA).
Several new computer vision algorithms were created and tested in Matlab. The
algorithms are compared against popular existing algorithms such as the DLT and OIA. Results influenced the selection of a specific algorithm combination within our pose estimation
system.
A 6-DOF overlay test platform was created for measuring and analyzing overlay error
in the end-to-end system. Virtual and real features are segmented from a video stream to
measure alignment error. Students completed a “realism” questionnaire, with and without
the gyros enabled. Responses indicated the need for the IMU at times when fiducials are
not visible.
ii
Contents
List of Figures
vi
Chapter 1 Introduction
1
1.1
Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4
Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.5
Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Chapter 2 Rigid Body Pose & Kinematics
6
2.1
Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2
Rotation Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
2.2.1
Euler Angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.2
Angle-Axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.3
Quaternions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Kinematics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.3.1
Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.3.2
Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.3.2.1
Angular Velocity . . . . . . . . . . . . . . . . . . . . . . . .
16
2.3.2.2
Euler-Angles . . . . . . . . . . . . . . . . . . . . . . . . . .
18
i
2.3.2.3
Quaternion . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 3 Sensors & Data Processing
3.1
3.2
21
Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.1.1
Pinhole Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.1.2
Image Point Correction . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.1.3
Traffic Cones as Fiducials . . . . . . . . . . . . . . . . . . . . . . . .
25
3.1.4
Model Point Correction
. . . . . . . . . . . . . . . . . . . . . . . . .
27
3.1.5
Cone Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.1.5.1
Color Classification . . . . . . . . . . . . . . . . . . . . . . .
28
3.1.5.2
Connected Components . . . . . . . . . . . . . . . . . . . .
31
3.1.5.3
Blob Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.1.5.4
Template Matching . . . . . . . . . . . . . . . . . . . . . . .
34
3.1.6
Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.1.7
Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
Inertial Measurement Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.2.1
Static Bias Correction . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.2.2
Calibration of Relative Sensor Orientation . . . . . . . . . . . . . . .
38
Chapter 4 Single Camera, Single Frame Pose
4.1
4.2
18
39
Types of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.1.1
Image Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.1.2
Object Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
Classic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.2.1
Perspective-3-Point - Fischler & Bolles . . . . . . . . . . . . . . . . .
42
4.2.2
Direct Linear Transform - Abdel-Aziz & Karara . . . . . . . . . . . .
44
4.2.3
“Absolute Orientation” - Horn . . . . . . . . . . . . . . . . . . . . . .
48
ii
4.2.3.1
Optimal Rotation - SVD . . . . . . . . . . . . . . . . . . . .
49
4.2.3.2
Optimal Rotation - Quaternion . . . . . . . . . . . . . . . .
49
Orthogonal Iteration Algorithm - Lu & Hager . . . . . . . . . . . . .
50
Experimental Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.3.1
Modified DLT + Gauss-Newton . . . . . . . . . . . . . . . . . . . . .
51
4.3.1.1
Linear Transformation . . . . . . . . . . . . . . . . . . . . .
52
4.3.1.2
Image Space System . . . . . . . . . . . . . . . . . . . . . .
52
4.3.1.3
Object Space System . . . . . . . . . . . . . . . . . . . . . .
53
4.3.1.4
Null Space Solution . . . . . . . . . . . . . . . . . . . . . .
54
4.3.1.5
Final Optimization . . . . . . . . . . . . . . . . . . . . . . .
55
3-Simplex + Gauss-Newton . . . . . . . . . . . . . . . . . . . . . . .
57
4.3.2.1
3-Simplex and Barycentric Coordinates . . . . . . . . . . . .
57
4.3.2.2
Measurement Matrices . . . . . . . . . . . . . . . . . . . . .
59
4.3.2.3
Null Space Solution . . . . . . . . . . . . . . . . . . . . . .
60
4.3.2.4
Final Optimization . . . . . . . . . . . . . . . . . . . . . . .
61
3-Simplex SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.3.3.1
Distance Constraint . . . . . . . . . . . . . . . . . . . . . .
63
4.3.3.2
”Linearization” + Gauss-Newton . . . . . . . . . . . . . . .
64
4.3.3.3
Dot Product Constraint . . . . . . . . . . . . . . . . . . . .
65
4.3.3.4
Orthogonal Triad Constraints . . . . . . . . . . . . . . . . .
65
4.3.3.5
Polynomial Semidefinite Program . . . . . . . . . . . . . . .
66
4.2.4
4.3
4.3.2
4.3.3
Chapter 5 Recursive Estimation
70
5.1
Recursive Estimation of Markov Processes . . . . . . . . . . . . . . . . . . .
71
5.2
Gaussian & Nonparametric Filters . . . . . . . . . . . . . . . . . . . . . . . .
72
5.3
Selection of Quaternion EKF . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
iii
5.4
5.5
Time Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.4.1
State Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.4.2
State Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.4.3
Uncertainty Prediction . . . . . . . . . . . . . . . . . . . . . . . . . .
78
Measurement Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.5.1
Camera Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.5.2
Gyro Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.5.3
Accelerometer Update . . . . . . . . . . . . . . . . . . . . . . . . . .
80
5.5.3.1
Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
5.5.3.2
Position and Orientation . . . . . . . . . . . . . . . . . . . .
81
Quaternion Constraint . . . . . . . . . . . . . . . . . . . . . . . . . .
82
Camera Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
5.6.1
Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
5.6.2
Image Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.6.3
Object Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.6.4
Quaternion Constraint . . . . . . . . . . . . . . . . . . . . . . . . . .
87
5.5.4
5.6
Chapter 6 Real-time System
89
6.1
Hardware Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
6.2
Software Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
6.3
Measurement Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
6.4
Vision Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
6.4.1
Image Space Error Residual Constraint . . . . . . . . . . . . . . . . .
97
6.4.2
Relative Change in Motion Constraint . . . . . . . . . . . . . . . . .
98
Graphics Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
6.5
Chapter 7 Experiments
102
iv
7.1
Computer Vision Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.1.1
Overall Algorithm Comparison . . . . . . . . . . . . . . . . . . . . . 103
7.1.2
Best Algorithm Combinations . . . . . . . . . . . . . . . . . . . . . . 105
7.1.3
Camera Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2
Overlay Test Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3
Human Testing - Realism
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 8 Summary and Conclusions
111
Bibliography
113
Chapter 9 Appendix
116
9.1
DLT Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.2
Simplex SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.2.1
Three Variable/Vector Case . . . . . . . . . . . . . . . . . . . . . . . 117
9.2.2
Four Variable/Vector Case . . . . . . . . . . . . . . . . . . . . . . . . 118
9.3
Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.4
Computer Vision Algorithm Comparison . . . . . . . . . . . . . . . . . . . . 121
v
List of Figures
3.1
Perspective Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2
Cone Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.3
Model Point Correction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.4
Cone Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.5
Cone Blob
29
3.6
RGB Color Space
3.7
HSI Color Space
3.8
8-Connectivity
3.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
Pyramid Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.10 Cone with Ellipse
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.12 IMU Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.1
Error Spaces
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.2
Perspective Tetrahedron . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.3
Congruent Triangle Alignment
. . . . . . . . . . . . . . . . . . . . . . . . .
45
5.1
Markov Process
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
5.2
EKF Prediction and Correction Cycle . . . . . . . . . . . . . . . . . . . . . .
76
5.3
Hyperplane Tangent to Hypersphere . . . . . . . . . . . . . . . . . . . . . .
88
3.11 Cone Templates
vi
6.1
Hardware Components & Sensors . . . . . . . . . . . . . . . . . . . . . . . .
90
6.2
Sensors & LCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
6.3
Libraries / Dependencies
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
6.4
Full System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
6.5
Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
6.6
Measurement Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
6.7
Measurement Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
6.8
Vision Thread
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
6.9
Graphics Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
6.10 C-130 Hercules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
6.11 Kuka KR6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.12 Peaks Plot
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1
Computer Vision Algorithm Test Groups
. . . . . . . . . . . . . . . . . . . 103
7.2
“Best Algorithm Combinations”
7.3
Camera Uncertainty, Image Space, Euler Angles
7.4
Camera Uncertainty, Image Space, Quaternion
7.5
Camera Uncertainty, Object Space, Euler Angles
7.6
Camera Uncertainty, Object Space, Quaternion . . . . . . . . . . . . . . . . 107
7.7
Overlay Test Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.8
Overlay Accuracy, without Gyros (Inches) . . . . . . . . . . . . . . . . . . . 109
7.9
Overlay Accuracy, with Gyros (inches) . . . . . . . . . . . . . . . . . . . . . 109
. . . . . . . . . . . . . . . . . . . . . . . . 105
. . . . . . . . . . . . . . . 106
. . . . . . . . . . . . . . . . 107
. . . . . . . . . . . . . . . 107
7.10 Realism Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.1
Group 1, 10 Points, Top: σuv = 5, Bottom: σuv = 10
. . . . . . . . . . . . . 121
9.2
Group 1, 25 Points, Top: σuv = 5, Bottom: σuv = 10
. . . . . . . . . . . . . 122
9.3
Group 2, 10 Points, Top: σuv = 5, Bottom: σuv = 10
. . . . . . . . . . . . . 123
vii
9.4
Group 2, 25 Points, Top: σuv = 5, Bottom: σuv = 10
. . . . . . . . . . . . . 124
9.5
Group 3, 10 Points, Top: σuv = 5, Bottom: σuv = 10
. . . . . . . . . . . . . 125
9.6
Group 3, 25 Points, Top: σuv = 5, Bottom: σuv = 10
. . . . . . . . . . . . . 126
9.7
Group 4, 10 Points, Top: σuv = 5, Bottom: σuv = 10
. . . . . . . . . . . . . 127
9.8
Group 4, 25 Points, Top: σuv = 5, Bottom: σuv = 10
. . . . . . . . . . . . . 128
9.9
Group 5, 10 Points, Top: σuv = 25, Middle: σuv = 50, Bottom: σuv = 75
viii
. . 129
Chapter 1
Introduction
1.1
Augmented Reality
Augmented Reality (AR) is a combination of computer generated stimulus with the realworld which aims to augment human experience or perception. It is similar to Virtual Reality
(VR); however, in VR the user is completely immersed in a synthetic environment; whereas,
in AR the user still has the ability to see and interact with the real world. Therefore,
AR supplements reality rather than completely replacing it. Ideally, the user perceives
computer generated stimulus as coexisting in the same environment. Most often this stimulus
consists only of visual data, but audio and haptics may also be used. The field is currently
receiving much attention from researchers in both academia and industry. Most of the recent
interest results from extremely powerful hardware at decreasing cost that allows capable and
affordable systems to be constructed from off-the-shelf components. Azuma [4] [5] are good
resources for general background augmented reality information.
The primary motivation behind AR is that it allows information to be provided to
the user that cannot be detected directly with his own senses. This concept may be applied
within an unlimited number of domains. Successful implementations have existed for a few
years now, [26], but significant improvements are still needed before the technology is widely
accepted. Complicated tasks, such as assembly, maintenance, and surgery, may be supported
by the insertion of additional information and visualization of hidden objects. Instructions
and maps may be presented to users during navigation within buildings, outdoors, or while
driving. Visualization of architecture, historic buildings, planned construction projects, and
1
other complex data sets is much more meaningful to a person if viewed three-dimensionally as
part of an actual scene. To avoid the high risk or cost of training, flight and driving simulators
are another attractive application area. In robotics, sensitive tasks, such as path planning,
may be supported by augmented teleoperation or telepresense. Finally, entertainment and
education may benefit from virtual objects in the classroom, museum exhibitions, theme
park attractions, and interactive games.
For visual augmentation, AR requires a device to present data to the user. Headmounted displays (HMD) are most common, but a monitor/LCD may be used also. Basically,
two different types of technologies exist, optical and video see-through. Display options are
covered in more detail within chapter six.
1.2
Registration
The major challenge for AR systems is how to combine the real and virtual world into a
single environment while maintaining the illusion that virtual objects are indeed part of the
real world. Consistent registration is required or the illusion may be compromised. More
seriously, many applications demand a high degree of accuracy. Specifically, the relationship
of the user and objects within the world must be known for objects to be aligned correctly.
Typically, the relationship of objects to the world is static while the user’s relationship to the
world is dynamic. It is natural for the user to move to different locations and view objects
from different angles. Obviously, an AR system cannot control the motions of the user. The
user can look or move anywhere and the system must respond almost immediately.
Registration errors result from numerous sources and can be divided into two types,
static and dynamic. Static registration errors include both misalignment and jitter, or shaking. Misalignment errors are usually caused by poor calibration of sensors and optics. Jitter
is caused by noise in highly sensitive sensors. With dynamic error, virtual objects appear to
swim and lag thereby becoming disconnected from real objects. The primary causes of this
error are system delays, or lags, and include sources such as sensor processing and graphics
generation. The total (end-to-end) system delay can be measured as the time from when the
tracking system takes a measurement to when computer generated data is presented to the
user. System delays cause registration errors only when motion occurs. Tracking accuracy
is another cause of dynamic error. For most systems it is more difficult to accurately determine pose when fast motion occurs. Many perceptual issues arise as a result of registration
2
error. The mind compensates for small misalignment, but at some point, the system begins
to degrade. In general, error during quick user motion is less detectable. A threshold of
tolerable registration error has yet to be well established. Refer to [34], [28], [17], and [10]
for a more detailed discussion.
1.3
Pose Estimation
The combination of position and orientation is referred to as pose. Real-time pose estimation and graphics generation are a necessity to create a realistic augmented environment.
However, pose estimation is in no way limited to AR applications; navigation and robotics
are two classic examples of fields that also depend on accurate pose estimation.
One of the oldest form of navigation relies on the sun and other stars. This was especially important for sailors in the middle of the ocean where no land marks exist. Celestial
navigation is now used on spacecraft by an optical system referred to as a ”star tracker”.
On the other hand, for a robot to successfully manipulate or avoid collision with an object,
it must know the relationship of the object with respect to itself and the world.
A variety of sensors may be used for pose estimation. Cameras are a passive optical
sensor because they only receive light. Active optical sensors, such as LIDAR, emit light.
Other sensors include acoustic, magnetic, inertial, GPS, and wheel encoders. Of course,
they all have strengths and weaknesses, so for this reason multiple sensors are often used to
complement each other.
Computer vision techniques applied to camera images is probably the most popular
method of close range pose estimation. Given that the sensor is calibrated, the mapping from
3D points in the scene to 2D points in the image is well known. Therefore, the projected
image of an object (of known geometry) is a function of the object’s pose only. Once a set of
control points on the object (typical corners or other feature points) have been identified, it
is then possible to solve for the pose from a set of equations which relate the 3D coordinates
with their 2D image coordinates. Naturally occurring features may be used as control points,
but they are typically difficult to reliably detect. Unnatural features are often intentionally
introduced because they are easier to detect or unique natural features simply do not exist.
A few examples include planar patterns, concentric dots, and LEDs.
3
1.4
Thesis Contributions
The research project initially started as part of a military training program for C130 loadmaster students. The military’s goals were to save money by reducing the time spent in
an actual aircraft and be able to simulate dangerous scenarios without the risks of trainee
health and equipment damage. Original design concepts of an outdoor tracking system were
handed over to the Air Force, and since then the project has grown into something much
different.
An actual AR system was constructed at the Colorado School of Mines specifically
for this research project. It was created for algorithm development and to facilitate future
research in the field. The system runs at frame rate, 30 Hz, and maintains highly accurate dynamic overlay. Significant contributions were made in pose estimation and system
architecture. Specifically,
• Development of a real-time augmented reality system utilizing 3D cone fiducials.
• Development of new pose estimation algorithms, and analysis against the state of the
art.
• Development of a new Quaternion-pose uncertainty calculation for object space computer vision algorithms, and verification with a Monte Carlo method.
• Experimental analysis of static and dynamic registration performance with different
tracking strategies, including both automated numeric and human testing results.
1.5
Thesis Overview
The research contained within this thesis may be summarized as the design, implementation,
and testing of a new vision and inertial pose estimation system for registration with a low
cost handheld augmented reality device.
Background information on rigid body pose, rotation parameterization, and kinematics is covered in chapter two. This information is essential for understanding the goal of a
pose estimation system and serves as a foundation for advanced concepts in later chapters.
Chapters 3-6 cover the design and implementation of the system. These chapters are very
4
math intensive and attempt to fully explain how a camera and inertial sensor are used to
estimate the 6-DOF pose of an object for the purpose of registration.
In chapter seven, various tests of the system are presented. Matlab is used to test
several computer vision algorithms to see how well they perform under various conditions,
such as number of feature points and amount of noise. A new computer vision algorithm
is tested against existing algorithms as a comparison. Monte Carlo simulations are used
to verify a new object space uncertainty calculation. An overlay test is used to determine
the quality of overlay in the end-to-end system. Finally, results of an augmented reality
questionnaire are provided. The questionnaire was completed by students after being given
the opportunity to use the system firsthand with several different virtual objects.
5
Chapter 2
Rigid Body Pose & Kinematics
This chapter serves as a foundation for much of the thesis. First, rigid body pose, position
and orientation, is covered extensively and several rotation parameterizations are discussed.
Then, rigid body kinematics is covered, specifically angular velocity. Some of the material
might be a review for those who are already familiar with spatial transformations, but it still
serves as a convenient reference for advanced concepts developed in later chapters. Many
basic concepts come from Craig [8] and advanced concepts, such as quaternions, come from
various other sources including [11], [9], [6], [33], [30].
2.1
Basic Concepts
A rigid body is an idealization of a solid body of finite size in which deformation is neglected.
In other words, the distance between any two points on a rigid body remains constant in
time. The pose of a rigid body can be described by a combination of translation and rotation
from a given reference frame. For this purpose, a reference frame is selected to be rigidly
attached to the body itself. This is typically referred to as the ”object” reference frame {O}.
The position of its origin and the orientation of its axes with respect to a given “global” or
“world” reference frame {W} represent the pose of the body.
A reference frame is defined as a right-handed orthogonal triad of unit vectors, X̂, Ŷ ,
Ẑ, that share a common point of intersection at the origin. The triad forms a basis for the
coordinate system. The orientation of one coordinate system, {B}, with respect to another,
6
{A}, may be described by the direction cosine matrix (DCM),


X̂B · X̂A ŶB · X̂A ẐB · X̂A


A
ŶB · ŶA ẐB · ŶA  ,
BR =  X̂B · ŶA
X̂B · ẐA ŶB · ẐA ẐB · ẐA
(2.1)
often referred to as a rotation matrix. By close inspection of rows and columns we see that
A
BR
=
h
A
X̂B
A
A
ŶB
ẐB
i
B T
X̂A
B T
=  ŶA  .
B T
ẐA
(2.2)
Clearly, the orientation of frame {A} relative to {B} is the transpose of the DCM above.
Hence, the inverse of a DCM is equal to its transpose,
A
BR
T
B −1
= B
,
AR = AR
(2.3)
and therefore the matrix is orthogonal. Since 1 = det(I) = det(RT R) = det(RT ) det(R) =
det(R)2 an orthonormal matrix has determinant equal to ±1, and thus the mapping defined
by this matrix preserves vector length. Additionally, vector orientation is preserved or reversed depending on whether the determinant is positive or negative. A proper rotation
matrix has determinant equal to +1 and belongs to the special orthogonal group, SO(3). An
improper rotation matrix has determinant equal to −1 and does not belong to the rotational
group.
Consecutive multiplication of rotation matrices often arises in many fields, such as
manipulator kinematics, where multiple reference frame exist. Multiplication of rotation
matrices is associative,
A
B C
BR( CR DR)
B
C
A B C
= (A
BR CR) DR = BR CR DR ,
(2.4)
A B
BR CR
(2.5)
but not commutative,
A
6= B
CR BR .
It is likely that the origins of two reference frames are not coincident. The displacement between origins is known as translation. In space, a three element vector is used to
describe the position of a reference frame’s origin relative to another frame. For example,
7
the translation of frame {B} relative to frame {A} is equivalent to the position of frame
{B}’s origin within frame {A}, written as
A
tB = APBorg .
(2.6)
The homogeneous transformation matrix,

r11

r21
H= R t =

r
31
0 1
0
r12
r22
r32
0
r13
r23
r33
0

tx
ty 

tx  ,
(2.7)
1
is a convenient way to combine rotation and translation into a single matrix. An arbitrary
point,
h
iT
A
A
A
A
Pi = Xi Yi Zi
,
(2.8)
of known location in one reference frame, {A}, may be expressed in another frame, {B}, by
a rotation and translation,
B
A
B
Pi = B
(2.9)
AR Pi + tA .
With the homogeneous transform, a point set of arbitrary size,
A
P∗ =
h
A
P1 . . .
A
i
Pn ,
(2.10)
can be transformed in a single matrix multiplication,
"
#
"
#
A
P∗
P∗
B
= AH
.
1
1
B
(2.11)
This operation is more difficult to achieve when the rotation and translation are used separately.
2.2
Rotation Parameterization
According to Euler’s rotation theorem the general displacement of a rigid body (or coordinate
frame) with one point fixed is described by a rotation about some axis. Furthermore, such a
rotation may be uniquely described by a minimum of three parameters. However, for various
8
reasons, there are several ways to represent such a rotation. Many of these representations
use more than the necessary minimum of three parameters; although such representations
still only have three degrees of freedom.
Cayley’s transform for orthogonal matrices states that any orthogonal matrix, O of
any size, which does not have −1 as an eigenvalue can be represented by a skew-symmetric
matrix, S = −S T , via
O = (I − S)−1 (I + S) .
(2.12)
Similarly, the skew-symmetric matrix may be computed as
S = (O − I)(O + I)−1 ,
(2.13)
which illustrates that the nine elements of a rotation matrix are not independent and may
be described by a minimum of 3 parameters,

0 −sz sy


S3 =  sz
0 −sx  .
−sy sx
0

(2.14)
However, the mapping from DCM to skew symmetric matrix is unbounded for certain orthogonal matrices.
Furthermore, six constraints may be placed on the nine elements of the DCM leaving
only three degrees of freedom. By letting X̂, Ŷ , and Ẑ equal the rows or columns of the
rotation matrix, the constraints may be written as
kX̂k = kŶ k = kẐk = 1 ,
(2.15)
X̂ · Ŷ = X̂ · Ẑ = Ŷ · Ẑ = 0 .
(2.16)
Using the direction cosines (or DCM) requires a total of nine parameters. Commonly used parameter triples include Euler angles, Cardan/Byrant angles, and Rodrigues
parameters. Four parameter representations include Euler angle-axis, Euler parameters, and
Cayley-Klien parameters. Only the four parameter representations behave well for arbitrary
rotations because a nonsingular mapping to a rotation matrix requires a set of at least four
parameters.
9
2.2.1
Euler Angles
Euler angles are the most commonly used parameter triplet and consist of three consecutive
rotations about orthogonal axes, i.e.






1 0
0
c(β) 0 s(β)
c(α) −s(α) 0






Rx (γ) = 0 c(γ) −s(γ) , Ry (β) =  0
1 0  , Rz (α) = s(α) c(α) 0 ,
0 s(γ) c(γ)
−s(β) 0 c(β)
0
0
1
(2.17)
4!
= 24 possible rotation combinations
where s(θ) = sin(θ) and c(θ) = cos(θ). There are (4−3)!
so it is very important to specify the convention used. Many problems arise from this
ambiguity. For example, the XYZ and ZYX Euler angle conventions are
R(θxyz ) = Rx (γ)Ry (β)Rz (α)
(2.18)
R(θzyx ) = Rz (α)Ry (β)Rx (γ)
(2.19)
and
respectively.
In many domains Euler angles provide a very intuitive parameterization of rotation,
such as manipulator kinematics, which is a likely cause of such popularity. However, they do
suffer from singularities in their mapping from DCM to rotation angles and are not a good
parameterization choice for arbitrary rotations. For example, the two conventions shown
above both have singularities at |β| = π2 . The situation when singularities are encountered
is referred to as “gimbal lock”; effectively, a degree of freedom has been lost which the
parameterization cannot handle.
2.2.2
Angle-Axis
Decomposition of a rotation matrix, R, will yield eigenvalues of {1, e±iθ } or equivalently
{1, cos(θ) ± i sin(θ)}. The product of the eigenvalues is equal to the determinant and thus
equal to +1. The angle θ which appears in the eigenvalue expression corresponds to the
magnitude of rotation in Euler’s angle-axis representation. The eigenvector corresponding
to the +1 eigenvalue is the accompanying Euler axis. A rotation matrix parameterized by
10
the Euler angle-axis may be written as

kx2 va + ca
kx ky va − kz sa kx ky va + ky sa


R = kx k2 va + kx sa
ky kz va − kx sa
ky2 va + ca
kx kz va − ky sa ky kz va + kx sa
kz2 va + ca

(2.20)
where ca = cos(θ), sa = sin(θ), and va = 1 − cos(θ). The expression may be written in a
more compact form that uses a skew symmetric matrix similar to those found in Cayley’s
transform for orthogonal matrices, specifically
R = I + sin(θ)S + (1 − cos(θ))S 2 ,
(2.21)
where


0 −k̂z k̂y


S =  k̂z
0 −k̂x  .
−k̂y k̂x
0
(2.22)
This leads to a calculation of the Euler angle-axis that does not require eigenvalue
decomposition of the rotation matrix. Some simple algebra yields the following result. First,
the angle may be calculated from the trace of the matrix.
−1
θ = cos
tr(R) − 1
2
, θ ∈ [0, π]
(2.23)
When θ = 0 any unit vector is a valid axis because there is no rotation. If θ ∈ (0, π), the
axis may be extracted directly from
S=
R − RT
2 sin(θ)
(2.24)
so that


r32 − r23
1


k=
r13 − r31  .
2 sin(θ)
r21 − r12
11
(2.25)
However, when θ = π, the axis must be constructed differently since R − RT = 0 and


1 − 2(ky2 + kz2 )
2kx ky
2kx kz


R = I3 − 2 S 2 = 
2ky kz
2kx ky
1 − 2(kx2 + kz2 )
 .
2kx kz
2ky kz
1 − 2(kx2 + ky2 )
(2.26)
The most stable way of calculating the axis is to extract the maximum component from the
diagonal entries of the rotation matrix. From the diagonal terms we obtain
kx2 =
1 + r11 − r22 − r33
1 − r11 + r22 − r33
1 − r11 − r22 + r33
, ky2 =
, kz2 =
,
4
4
4
(2.27)
and from the off-diagonal terms we obtain
kx ky =
r12
r21
r13
r31
r23
r32
=
, kx kz =
=
, ky kz =
=
.
2
2
2
2
2
2
(2.28)
To calculate the axis simply choose the equation from equation 2.27 that yields the largest
value and then calculate the remaining two parameters from equation 2.28.
It is worth noting that the uniqueness of the angle-axis is only two because the opposite
angle and axis represent the same rotation. i.e. R(k̂, θ) = R(−k̂, −θ). Therefore, the inverse
rotation may be found by negating either the axis or angle. The axis is most often negated
so that the angle remains within the domain specified earlier, [0, π].
2.2.3
Quaternions
Quaternions were developed by Hamilton in 1843 and are a non-commutative extension
of complex numbers into four-dimensions. Although they have been superseded in most
applications by ordinary vector algebra, they still find use in both theoretical and applied
mathematics, in particular, calculations involving three-dimensional spatial rotations. The
set of quaternions is defined as
Q = {q0 + qx i + qy j + qz k | q0 , qz , qy , qz ∈ R} .
12
(2.29)
The quaternion may be viewed as a linear combination of a scalar and vector, q = q0 + ~q.
As an ordinary vector, the quaternion will be written as
h
iT h
iT
.
q = q0 ~q = q0 qx qy qz
(2.30)
A unit quaternion satisfies the normality constraint
kqk = q02 + qx2 + qy2 + qz2 = 1 ,
(2.31)
and like any other vector, a quaternion can be “normalized” to unit length by simply dividing
q
. The quaternion conjugate is essentially the same as the regular complex
the norm, q̂ = kqk
conjugate,
h
iT
q ∗ = q0 − ~q = q0 −qx −qy −qz
,
(2.32)
and the quaternion inverse is the conjugate divided by the norm, i.e.
q −1 =
q∗
.
kqk
(2.33)
Euler-Rodrigues parameters are a unit quaternion. Rodrigues proposed the idea
around the same time Hamilton invented quaternion theory. The quaternion represents
a rotation in terms of the Euler angle-axis as
θ
θ
k̂ .
, ~q = sin
q0 = cos
2
2
(2.34)
Therefore, the identity quaternion of rotation has a +1 scalar and a zero vector,
iT
qI = 1 0 0 0 .
h
(2.35)
The singularity within the angle-axis representation occurs at zero rotation because the axis
is undefined. As shown above, quaternions do not suffer from this problem; hence, they have
a nonsingular mapping. A rotation matrix parameterized by the quaternion may be written
as
 2

q0 + qx2 − qy2 − qz2
2(qx qy − q0 qz )
2(qx qz + q0 qy )


R =  2(qx qy + q0 qz ) q02 − qx2 + qy2 − qz2
(2.36)
2(qy qz − q0 qx )  .
2(qx qz − q0 qy )
2(qy qz + q0 qx )
13
q02 − qx2 − qy2 + qz2
For unit quaternions the expression can be simplified along the diagonal, so that


1 − 2(qy2 + qz2 ) 2(qx qy − q0 qz ) 2(qx qz + q0 qy )


R = 2(qx qy + q0 qz ) 1 − 2(qx2 + qz2 ) 2(qy qz − q0 qx ) .
2(qx qz − q0 qy ) 2(qy qz + q0 qx ) 1 − 2(qx2 + qy2 )
(2.37)
This leads to a calculation of the rotation quaternion from the rotation matrix directly
instead of using the angle-axis. From the diagonal terms we obtain
1 + r11 + r22 + r33
1 + r11 − r22 − r33
, qx2 =
4
4
1 − r11 + r22 − r33
1 − r11 − r22 + r33
2
2
qy =
, qz =
,
4
4
q02 =
(2.38)
and from the off-diagonal terms we obtain
r32 − r23
r13 − r31
r21 − r12
, q 0 qy =
, q 0 qz =
4
4
4
r13 − r31
r23 − r32
r12 − r21
, q x qz =
, q y qz =
.
q x qy =
4
4
4
q0 qx =
(2.39)
The most stable way of calculating the quaternion is to extract the maximum component
from the diagonal terms of the rotation matrix. Choose the equation from 2.38 that yields
the largest value and then calculate the remaining three parameters from equation 2.39.
Like the angle-axis, the uniqueness of the rotation quaternion is only two because
the negated quaternion represents the same rotation, R(q) = R(−q). In other words, two
opposite poles on the hypersphere represent the same rotation.
The equation i2 = j 2 = k 2 = ijk = −1 is the fundamental formula for multiplication
of complex numbers in four dimensions from which a set of quaternion multiplicate identities
can be derived,
ij = k , ji = −k
jk = i , kj = −i
(2.40)
ki = j , ik = −j .
From the derivation, we can see that quaternion multiplication, ⊗, is associative (and distributive) but clearly not commutative.
A 3-dimensional vector, such as position, can be represented as a quaternion with zero
14
scalar part,
"
v=
0
B
Pi
#
"
, v0 =
0
A
Pi
#
.
(2.41)
The vector can be rotated directly with quaternion multiplication without first converting
back into a rotation matrix. Let R(q) = A
BR, then
v 0 = q̂ ⊗ v ⊗ q̂ −1 .
(2.42)
The same operation can be performed with ordinary vector multiplication of right-side and
left-side skew symmetric matrices that obey the multiplicate identities above,
where
2.3
v 0 = QL (q)QR (q −1 ) v ,
(2.43)



q0 −qx −qy −qz
q0 −qx −qy −qz




qx q0 −qz qy 
 qx q0

q
−q
z
y
 , QR (q) = 
 .
QL (q) = 
q


qx 
q0 −qx 
 y qz
qy −qz q0

qz −qy qx
q0
qz qy −qx q0
(2.44)

Kinematics
Kinematics is a branch of mechanics which studies the motion of objects without regard of
the underlying forces involved. Dynamics accounts for some or all of the forces which impact
the motion. In our case, motion refers to changes in position and orientation in time. To
reiterate, we are only concerned with the motion and not the underlying forces.
2.3.1
Position
The translational motion of a rigid body may be described as the time derivative of the
origin’s location within a reference frame. Simply imagine a particle moving freely in space.
Typically, the motion may be sufficiently represented in discrete-time by the first and second
15
derivatives, velocity and acceleration respectively.
∆t2
p̈k−1
2
∆t2
= pk−1 + ∆t vk−1 +
ak−1
2
pk = pk−1 + ∆t ṗk−1 +
2.3.2
(2.45)
Orientation
The rotational motion of a rigid body may be described by the time derivative of its orientation. Imagine an orthogonal triad spinning about an axis pointing out of its origin. Angular
velocity is the instantaneous rotation about this axis. The length of the axis is proportional
to the magnitude of rotation.
2.3.2.1
Angular Velocity
A simple derivation is needed to clarify the statement made above and develop a mathematical foundation of angular velocity. Over a small time interval the rotation caused by angular
velocity may be represented by a rotation matrix. The new orientation is the product of this
matrix and the old orientation. Depending on the reference frame of the angular velocity
the order of multiplication may be reversed. In the following derivation the angular velocity
of frame {B} expressed in its own coordinate system is considered. A small rotation occurs
so that
A
A
(2.46)
BRk+1 = BRk R(∆θ, k̂) .
The time derivative of the rotation matrix is found in the limit as the change of time approaches zero.
A
BṘk
−A
BRk
∆t→0
∆t
A
Rk R(∆θ, k̂) − A
BRk
= lim B
∆t→0
∆t
!
R(∆θ,
k̂)
−
I
3
= A
lim
BRk
∆t→0
∆t
= lim
A
BRk+1
16
(2.47)
Using small angle substitutions into equation 2.20 the following result is obtained.


1
−kz ∆θ ky ∆θ


R(∆θ, k̂) =  kz ∆θ
1
−kx ∆θ
−ky ∆θ kx ∆θ
1
(2.48)
So that


0
−kz θ̇ ky θ̇


A
A
0
−kx θ̇ .
BṘk = BRk  kz θ̇
−ky θ̇ kx θ̇
0
(2.49)
The three elements of θ̇k̂ are the angular velocity vector contained within the skew symmetric
matrix,


0 −ωz ωy


(2.50)
S(ω) =  ωz
0 −ωx  ,
−ωy
ωx
0
which can be calculated as
T A
S( BωBk ) = A
BRk BṘk .
(2.51)
As stated earlier, the order of multiplication changes depending on the reference frame
of the angular velocity. The derivation is almost the same for the angular velocity of frame
{B} expressed in frame {A}. A summary of the results is provided below.
A
BRk+1
= R(∆θ, k̂) A
BRk
(2.52)
= S( AωBk ) A
BRk
(2.53)
A T
S( AωBk ) = A
BṘk BRk
(2.54)
A
BṘk
Using the previous results, discrete-time rotation update can be approximately expressed as
A
A
B k
(2.55)
BRk+1 = BRk I3 + ∆t S( ωB ) ,
where ∆t is the time between samples k and k+1.
17
2.3.2.2
Euler-Angles
Even though the rotation matrix may be parameterized in a number of ways, a relationship
(usually) exists between angular velocity and the time derivative of the parameterization. If
singularities in the parameterization exist than the relationship is not always defined.
To illustrate this point, consider the relationship between angular velocity and ZYX
Euler angles. Angular velocity in terms of Euler angle time derivatives may be expressed by
a (convention specific) Jacobian,
A
ωB = J(θxyz ) θ̇xyz
 
   
γ̇
0
0
 
   
= Rz (α)Ry (β)  0  + Rz (α) β̇  +  0 
0
0
α̇ ,

 
cos(β) cos(α) − sin(α) 0
γ̇

 
=  cos(β) sin(α) cos(α) 0 β̇ 
− sin(β)
0
1 α̇
(2.56)
where R(θxyz ) = A
BR. The inverse relationship, Euler-Angle derivatives as a function of
angular velocity, may be found by symbolically inverting the Jacobian above,
θ̇xyz = J(θxyz )−1 AωB
= J(θxyz )−1 R(θxyz ) BωB
(2.57)
where

J(θxyz )−1

sec(β) cos(α) sec(β) sin(α) 0


=  − sin(α)
cos(α)
0 .
tan(β) cos(α) tan(β) sin(α) 1
(2.58)
Obviously, the secant and tangent terms are unbounded near |β| = π2 leading to numerical
instability. A situation like this can have catastrophic effects on a system and should be
avoided if possible.
2.3.2.3
Quaternion
As stated earlier, quaternions are a singularity-free parameterization of rotation. Therefore,
the relationship between angular velocity and the quaternion time derivative always exists
18
and numerical instabilities due to inversion should never be encountered in practice. For this
reason, quaternions are a favorite parameterization of rotation and are used in our system.
The derivation of the quaternion time derivative is similar to that of rotation matrices.
Over a small time interval the rotation caused by angular velocity may be represented by a
quaternion. The new orientation is the product of this quaternion and the old quaternion.
Depending on the reference frame of the angular velocity the order of multiplication may be
reversed. In the following derivation, BωB is considered. A small rotation occurs so that
qk+1 = qk ⊗ q(∆θ, k̂) ,
(2.59)
where R(q) = A
BR. The time derivative of the quaternion is found in the limit as the change
of time approaches zero.
qk+1 − qk
∆t→0
∆t
qk ⊗ q(∆θ, k̂) − qk
= lim
∆t→0
∆t
!
q(∆θ, k̂) − qI
= qk ⊗ lim
∆t→0
∆t
q̇k = lim
(2.60)
From the rotation matrix derivation we know that ω = θ̇k̂, and using small angle substitution
into equation 2.34 the following result is obtained.
"
q(∆θ, k̂) =
1
#
(2.61)
∆θ
k̂
2
"
" #
"
#
#
0
0
0
1
1
q̇k = qk ⊗ θ̇
= qk ⊗ B k
= qk ⊗
2
2
θ̇
k̂
ωB
k̂
2
(2.62)
"
#
0
Let Ω be the vector quaternion representing angular velocity, i.e. Ωk =
. Then
ωk
1
q̇k = qk ⊗ BΩkB
2
B k
ΩB
= 2 qk−1 ⊗ q̇k .
(2.63)
(2.64)
As stated earlier, the order of multiplication changes depending on the reference frame
19
of the angular velocity. The derivation is almost the same for AωB . A summary of the results
is provided below.
qk+1 = q(∆θ, k̂) ⊗ qk
(2.65)
q̇k =
A k
ΩB
1A k
Ω ⊗ qk
2 B
(2.66)
= 2 q̇k ⊗ qk−1
(2.67)
Recall that quaternion multiplication can be replaced by ordinary matrix multiplication of a 4-dimensional skew symmetric matrix. The previous result may be written as
q̇k =
where
1
QR ( BΩkB ) qk ,
2
(2.68)


0 −ωx −ωy −ωz


ωx

0
ω
−ω
z
y
 .
QR (Ω) = 
ω −ω
0
ωx 
z
 y

ωz ωy −ωx
0
(2.69)
Using the skew symmetric matrix, the approximate discrete time quaternion update can be
written as
∆t
B k
qk+1 = I4 +
QR ( ΩB ) qk
2
(2.70)
∆t
B k
QR ( ΩB ) qk .
= qk +
2
However, from the infinite Taylor series a pattern emerges, [36]. Assuming constant ω, the
following expression is a better approximation for larger ∆t and/or ω,
qk+1 = cos
k BΩkB k∆t
2
1
I4 + B k sin
k ΩB k
20
k BΩkB k∆t
2
QR ( BΩkB )
qk .
(2.71)
Chapter 3
Sensors & Data Processing
The pose estimation system relies on measurements from only two sensors, the camera and
IMU. This chapter discusses the processing of raw data from both sensors. The final output of
the data processing will be referred to as a sensor “measurement”. The chapter is divided into
two sections, camera and IMU. The information presented here is necessary to understand
material found in subsequent chapters. For example, the discussion of image formation using
the pinhole camera model is the basis for the entire next chapter. Fusion and filtering of the
two sensors is covered in chapter 5.
The data from both sensors is very different. The IMU data consists of six fields,
three angular velocities and three accelerations. An image consists of 1024x768x3 (about
2.36 million) data fields. Consequently, processing a camera image is much more expensive
and time consuming than the IMU.
The previous chapter included a discussion of 3D points expressed in multiple reference
frames. We are primarily concerned with only three reference frames, the model {M}, camera
{C}, and IMU {I}.
3.1
Camera
A rigid body may be parameterized by a minimum of four noncoplanar 3D points often
referred to as fiducials. These points are assumed to have known locations in the model
frame but not necessarily others. They are constantly used as a basis of reference and
therefore must be carefully defined. These points will simply be called “model points”.
21
Model points expressed in the model reference frame are denoted by
M
Pi =
h
M
M
Xi
Yi
M
Zi
iT
,
(3.1)
and it will sometimes be useful to write all model points together into a single matrix as
M
P∗ =
h
M
P1 . . .
i
M
Pn .
(3.2)
The projection model of the camera, covered next, requires points to be expressed in the
camera frame. Model points expressed in the camera reference frame are denoted as
C
Pi =
h
C
Xi
C
Yi
C
Zi
iT
,
C
P∗ =
h
C
P1 . . .
C
i
Pn .
(3.3)
A set of points expressed in two different reference frames are related by a homogeneous
transformation,
"
#
"
#
C
M
P∗
P
∗
= C
.
(3.4)
MH
1
1
Details of the homogeneous transformation were covered in the previous chapter. If the
pose of the model with respect to the camera, C
MH, is known then the model points may be
transformed into the camera and projected onto the image plane. In AR, this pose is very
important because it is needed to correctly register virtual objects.
3.1.1
Pinhole Camera Model
In the pinhole camera, rays of light travel through an infinitesimally small aperture, and the
intersection of the light rays with the image plane form an image of the object as shown
in figure 3.1. The mapping from three dimensions to two dimensions is called perspective
projection.
The focal length is the distance between the image plane and center of projection
(COP). As the focal length decreases, the viewing angle increases. The perpendicular line
from the COP through the image plane is the optical axis, usually the z-axis. The point on
the image plane where the optical axis passes though the image plane is the principle point.
To avoid a flipped image, we assume the image plane is in front of the COP.
22
Figure 3.1: Perspective Projection
The expression for normalized projection is
" #
"
#
C
Xi
ũi
1
= C
.
Zi CYi
ṽi
(3.5)
Note that this does not include any of the intrinsic camera parameters. Such points are
referred to as “normalized” image points.
Image coordinates are, of course, found using the cameras intrinsic parameters. The
parameters result from both the camera hardware and lens but are simply referred to as
camera parameters. If the lens is not fixed, such as a zoom lens, than some of these parameters will change with different lens settings. In most modern cameras the skew (s) is usually
close to zero. If the camera has square pixels the aspect ratio (τ ) will equal 1, or fv = fu .
Projection to image coordinates may be compactly written in two equations. These
two equations are referred to as the perspective projection equations,
Xi
r1∗ MPi + tx
ui = fu ũi + u0 = fu C + u0 = fu
+ u0
Zi
r3∗ MPi + tz
C
r2∗ MPi + ty
Yi
vi = fv ṽi + v0 = fv C + v0 = fv
+ v0 .
Zi
r3∗ MPi + tz
(3.6)
C
23
(3.7)
By collecting the intrinsic parameters into a matrix,

 

τ f s u0
fu s u 0

 

K =  0 f v0  =  0 fv v0  ,
0 0 1
0 0 1
(3.8)
image coordinates can also be computed as
 
 
ui
ũi
 
 
 vi  = K  ṽi  .
1
1
(3.9)
The Matlab Camera Calibration toolbox is used for estimating the camera’s intrinsic
parameters, fv , fv , u0 , v0 , k1 , k2 , p1 , p2 . Images of a calibration pattern, checker board,
are taken from many different poses and then processed by Matlab. For each image, a
grid is construct by segmenting the corners within the pattern. The intrinsic and extrinsic
parameters are determined simultaneously in a large nonlinear optimization. The extrinsic
parameters, pose of camera with respect to grid, are important later in the IMU calibration
section at the end of the chapter.
3.1.2
Image Point Correction
The pinhole model is only an approximation of real camera projection. While it is a useful
model that enables simple mathematical formulation of the relationship between object and
image coordinates, when high accuracy is required a more comprehensive camera model must
be used. A normalized image point with second-order radial and tangential distortion terms
added may be expressed as
" # " #
" # "
#
2
2
ūi
ũi
ũ
2
p
ũ
ṽ
+
p
(r
+
2
ũ
)
i
1
i
i
2
i
i
'
+ (k1 ri2 + k2 ri4 )
+
v̄i
ṽi
ṽi
p1 (ri2 + 2 ṽi2 ) + 2 p2 ũi ṽi .
(3.10)
ri2 = ũ2i + ṽi2
A raw image from the camera contains distortion, or warping. Typically, the effect of lens
distortion is more noticeable in a wide angle lens. Unfortunately, the inverse mapping from
24
Figure 3.2: Cone Geometry
distorted points to undistorted points,
" #
" #
ūi
ũi
→
,
v̄i
ṽi
(3.11)
is much more complicated and is usually done iteratively. We use the distortion model
described within [15], [16], [38]. An entire image may be undistorted, or to save time, only
a subset of feature points may be undistorted.
3.1.3
Traffic Cones as Fiducials
The goal of many computer vision applications is automatic detection of features, often referred to as segmentation. Segmentation always relies on image processing techniques in
some of the initial steps. Robust segmentation of complex features in unstructured environments is a very difficult task. As a result people try to use features that are very easy to
segment. Corners, dots, and lines are a few examples.
In augmented reality many people, such as ARToolKit, have used planar patterns that
are like a calibration grid except these patterns must be more unique. Multiple patterns
25
Figure 3.3: Model Point Correction
might be used in the same scene and viewed from many different orientations. However,
distance and viewing angle of the camera are highly restricted.
For those reasons, in our system we have decided to use 3-dimensional fiducials, orange
traffic cones. The cones are easy to segment because of their unique shape and color. They
also allow the system to be scalable because any size cone may be used. The geometry of
the cones is very simple and shown in figure 3.2. Each cone has three points of interest along
the major axis, a peak, middle, and base. Thus, a set of model points will be defined as
M
P∗ =
h
M
1
Ppeak
M
1
Pmid
M
1
Pbase
...
M
n
Ppeak
M
n
Pmid
i
n
.
Pbase
M
(3.12)
We use three traffic cones in an equilateral triangle configuration where the three peak,
middle, and base points define three parallel planes. Therefore, the major axes of the cones
are parallel and perpendicular to these planes.
26
3.1.4
Model Point Correction
In reality, the major axis of a cone is not visible. Depending on location, the camera views
a specific side of the cone’s outer surface. The location of the camera with respect to the
cone defines a line of sight. Imagine three vectors originating at a single point which end
at a cones peak, middle, and base points along the outer surface. The camera does not see
major axis and is why the model point correction is needed. The correct way of thinking
about the problem is that the major axis (and peak, middle, and base points) is projected
onto the surface. Given an angle about the major axis, the same three points on the cone’s
surface may be computed. The short routine below and figure 3.3 explain how to correct the
model points from the major axis to the surface given a predicted location of the camera.
Assume that we are only given the location of the COP, peak and base points for each
cone, and the radius to height ratio, in our case τ =1/7. The radius anywhere along the base
is found by r = τ h. The height may be calculated by the peak and base points via
i
h = MPbase
−
i Ppeak
.
M
(3.13)
As stated earlier, the cones all sit on the same plane. The nearest point of the COP on the
0
base plane, COP , may be computed in a few easy steps, and any cone can be used for the
calculation. The peak and base points also define a direction vector that is perpendicular to
the base plane,
M i
i
P
− MPpeak
~h = base
(3.14)
MP i − MP i .
base
peak
A vector from the COP to the base point defines a line of sight,
li =
M
i
Pbase
− COP .
(3.15)
0
COP may be calculated by projecting li onto ~h, i.e.
0
COP = COP + li · ~h ~h .
(3.16)
0
Now, a base direction toward COP may be computed for each cone via
0
COP −
~ri =
kCOP 0 −
27
M
i
Pbase
.
MP i k
base
(3.17)
Figure 3.4: Cone Segmentation
Finally, position adjusted model points may be calculated as
M
M
i
P̃base
=
3.1.5
M
i
Pbase
+ r (~ri ) ,
M
i
P̃peak
=
M
i
Ppeak
,
M
i
P̃mid
=
i
+
P̃peak
2
M
i
P̃base
.
(3.18)
Cone Segmentation
The goal of the cone segmentation is to automatically identify the 2D coordinates of the peak,
middle and base points of the cones within a single image. The segmentation procedure is
the most expensive part of the entire pose estimation system but still may be performed at
frame rate, 30 Hz, with a dual-core processor.
The procedure may be broken down into five distinct steps (figure 3.4): color classification, connected components, blob analysis, template matching, and image point correction.
3.1.5.1
Color Classification
The segmentation procedure relies heavily on the cone’s unique orange color which is not
usually found in most environments. Lighting conditions will however affect the color that
appears in an image. Robust color classification can tolerate normal light fluctuations. The
input to the color classification block is an RGB image and the output is a binary image of
cone pixels (1) and non-cone pixels (0). In other words, the output image should contain
only silhouettes of the cones, figure 3.5
Each pixel in an 8-bit RGB image contains a red, green, and blue component that
range from 0 to 255. Therefore, a total of 16.8 million RGB combinations exist. Imagine a
cube with one of its corners lying at the origin and the three associated sides aligned with
the axes, figure 3.6. Traveling along the axes will yield red, green, or blue in increasing
brightness. Moving off of the axes yields a mix of the colors. Black exists at {0,0,0} and
white at {255,255,255}. Traveling along the line from black to white results in gray scale
because no single color dominants. This is consistent with an additive color model.
28
Figure 3.5: Cone Blob
Figure 3.6: RGB Color Space
29
The hue, saturation, and intensity (HSI) space describes color in a more intuitive
fashion. For example, most color palettes use HSI space. Hue describes the blend of color and
is usually defined from 0 to 2π. Saturation defines the color purity or shade of gray between
black and white and is usually defined between 0-100%. Intensity defines the brightness of
the color and is usually defined between 0 - 100%. The HSI space looks like a cone instead
of a cube as seen in figure 3.7. A nonlinear mapping exists between the two color spaces.
For example, an RGB image may be converted to HSI by
h=



0,




π
max = min
g−b
,
3 max − min
π
b−r


+

3 max − min



 π r−g +
3 max − min
s=

0,
max = r
2π
,
3
max = g
4π
,
3
max = b
max = 0
 max − min , otherwise
max
v = max .
(3.19)
(3.20)
(3.21)
To define the color of an object a large sample of pixels may be collected in various
lighting conditions. Several different methods of analysis may be used including a multivariate Gaussian approximation, nonlinear PCA, and histogram. Ideally, a small region will
encapsulate most of the pixels. However, all materials have different light dispersion properties and may be more or less difficult to classify. If a color region is to be defined for
classification, HSI space works much better than RGB.
An alternative to the sampling approach is to allow the region to be manipulated
manually in real-time. The user simply adjusts the six parameters within

1, (H < H < H ) ∩ (S < S < S ) ∩ (I < I < I )
L
ij
H
L
ij
H
L
ij
H
Pixelij =
0, otherwise
(3.22)
and sees the result immediately. Under typical lighting conditions we have found that this
calibration step usually does not require any attention because the default values always
work very well.
To speed up the classification process in the real-time system a huge RGB classification
30
Figure 3.7: HSI Color Space
table is built. All of the 16.8 million RGB combinations are assigned a one or zero as cone and
non-cone pixels. In this way, pixels within an RGB image may be classified directly instead
of constantly converting to HSI and then classifying. Recall that the mapping between color
spaces is nonlinear.
3.1.5.2
Connected Components
Two-dimensional connectivity defines which pixels are connected to other pixels. A set of
pixels in a binary image that form a connected group are called a connected component,
region, or blob. “8-connected” pixels are connected if their edges or corners touch, figure
3.8. If two adjoining pixels are both ones, they are considered to be part of the same object,
whether connected along the horizontal, vertical, or diagonal direction.
There are many properties that can be helpful to describe a region. We are only
interested in the following properties.
• Area - the actual number of pixels in the region.
31
Figure 3.8: 8-Connectivity
• Solidity - the proportion of the pixels in the convex hull that are also in the region.
The convex hull is the smallest convex polygon that contains the region. Computed as
region area divided by convex area.
• Centroid - the center of mass of the region.
• Major Axis Length - the length of the major axis of the ellipse that has the same
normalized second central moments as the region.
• Minor Axis Length - the length of the minor axis of the ellipse that has the same
normalized second central moments as the region.
• Orientation - the angle between the x-axis and the major axis of the ellipse.
• Major-Minor Axis Ratio - the proportion of major axis to minor axis length. Computed
as major axis length divided by minor axis length.
• Bounding Box - the smallest rectangle that contains the region.
Forming connected components and region properties can be very expensive if the
image size is large and many artifacts exist. In this context, artifacts are small regions that
are too small to possibly be a cone. To save time, it is important to reduce the search
space by identifying a region of interest (ROI) that contains the fewest amount of artifacts
as possible. The ROI is quickly found in a two step procedure.
First, the image is down-sampled by pyramid decomposition, figure 3.9. Each level
in the pyramid has a quarter of the number of pixels as the level below it. Down sampling
occurs by simply rejecting even rows and columns. We only down-sample twice. Secondly,
the (binary) down-sampled image is searched for the smallest rectangle that contains all of
32
Figure 3.9: Pyramid Decomposition
the remaining cone pixels (1’s). The vertices of the rectangle in the down-sampled image
correspond to locations within the full-size binary image. Typically, the ROI is about half
the area of the full image.
3.1.5.3
Blob Analysis
Blobs are either accepted or rejected by the properties described above.
• Upper and lower bounds are placed on the area. The range is large enough to allow
significant variation in camera viewing distance of the cones.
• A border region is used to reject objects with centroids located near the edges of the
image. Part of the object might not be visible and therefore its properties cannot be
accurately determined.
• Upper and lower bounds are placed on the solidity. The cones are uniform color and
convex. Under ideal lighting conditions they have a solidity value very close to one.
Shadows and low light environments can cause cone pixels to be misclassified; this
causes the solidity to drop. Typically, solidity helps reject non-cone objects that have
similar color but are not uniform and/or convex.
33
Figure 3.10: Cone with Ellipse
• Upper and lower bounds are placed on the major-minor axis ratio. The cones used for
testing are about 4 inches tall with 1 inch diameter base. Therefore, we expect this
ratio to be somewhere around 4.
If the blob properties are acceptable than a final line search is performed. The orientation and centroid define two direction vectors along the major axis. The vectors point in
opposite directions and one is the negative reciprocal of the other. The peak and the base
are both located somewhere along these vectors, as shown in figure 3.10. The search starts
at the origin and moves along the major axis until three consecutive non-cone pixels are
found. This typically works very well because the centroid and orientation are well defined
properties of the cone shape.
3.1.5.4
Template Matching
The line search, above, provides a good estimate of the cone’s height and orientation in
the image. Template matching is performed to refine this estimate. The total number
of correlations is restricted to nine because it is a very costly operation. Cost increases
34
Figure 3.11: Cone Templates
with search area and template size. The nine templates include one that corresponds to
the initially estimated height and orientation followed by eight neighboring templates. A
neighboring template is one unit of height and/or orientation away from the initial estimate,
figure 3.11 The search region is a rectangle that is equal or greater size than the bounding
box of the blob.
The templates are very simple and contain a single white triangle with black background, a perfect silhouette of the cones. The centroid of the triangle is located exactly at
the image center. At program startup all of the templates are preloaded into memory and
indexed by height and orientation. The template set include a total of (180/3)*((600-100)/3)
= 10,000 binary images, but many are seldom used. A score is calculated at each pixel by
P
the cross-correlation c(u, v) = x,y I(x, y) T (x − u, y − v), where I is the image and the sum
is taken over x,y within the window containing the template T positioned at u,v.
3.1.6
Correspondence
The cones in the image correspond to specific cones in the model. Specifically, the the 3D
model points of 3.12 are matched to a set of image points,
"
1
1
n
n
n
u1
umid
ubase
. . . upeak
umid
ubase
c
p = peak
1
1
1
n
n
n
vpeak
vmid
vbase
. . . vpeak
vmid
vbase
35
#
.
(3.23)
The correspondence problem is difficult because the cones are all identical in appearance.
Also, there is a chance that non-cone objects are detected and actual cones are not detected
while in the the field of view. A predicted pose can be used to estimate where the cones
should appear in the image. Many algorithms exist to solve the correspondence problem
given noisy observations and a prediction. We only have three cones so a simple greedy
algorithm is used. All possible combinations are enumerated and scored. The combination
with the lowest score is kept if below a maximum distance threshold.
3.1.7
Pose Estimation
Given the camera’s intrinsic parameters, pose may be determined from a set of 2D image
points and corresponding 3D model points. The discussion of the problem is quite lengthly,
and for this reason, it will not be covered until the next chapter.
3.2
Inertial Measurement Unit
The IMU is a small 3-axis inertial sensor that provides high resolution angular velocity,
I
zgyro = ωimu
and acceleration
h
iT
= ωx ωy ωz
,
h
iT
zacl = Iaimu + Iagrav = ax ay az
,
(3.24)
(3.25)
data through a serial interface (RS232). MEMS gyroscopic rate sensors and accelerometers
are mounted on orthogonal axes, figure 3.12.
Because our goal is to estimate the pose of the camera with respect to the model, the
IMU measurements are transformed into the camera coordinate system. The IMU is rigidly
mounted to the camera so that the pose of the IMU with respect to the camera does not
change. Although the translation between the two sensors is actually around ten centimeters,
we set it equal to zero to simplify the math to follow. This means that the origins of the
two coordinate systems are effectively coincident at a single point, PCorg ≡ PIorg . Craig [8]
chapters 5 and 6 discusses the relationship of angular velocity and acceleration in different
36
Figure 3.12: IMU Coordinate System
reference frames. For two coincident reference frames, the relationship may be simplified to
C
I
ωcam ≡ C
I R ωimu ,
3.2.1
C
acam + Cagrav ≡ C
IR
I
aimu + Iagrav .
(3.26)
Static Bias Correction
We assume that the gyros output a measurement that contains a bias and Gaussian zeromean noise,
I 0
ωimu = Iωimu + bias + vgyro ,
(3.27)
where vgyro ∼ N (0, σgyro ). Modeling the accelerometer noise is much more complicated and
not covered here. The gyro bias seems to change by a small amount every time the unit is
turned on and might even drift slightly during operation. A good correction scheme would
constantly refine the biases during operation by using a complementary Kalman Filter, [29].
We call the correction “static” because of the way the bias is calculated. Many static
angular velocity measurements are recorded at several different orientations. This could be
done as a calibration step every time the system started. All axes of the gyros should read
zero if the unit is motionless. When these static values are plotted, the histogram has the
normal form with approximately the same non-zero mean at all orientations. Surprisingly,
the bias on each axis can be very different.
The static measurements also provide a measurement of uncertainty for the gyros.
37
After the bias is removed the variance for the measurement noise remains the same. An
estimate of such uncertainty is important in later chapters.
3.2.2
Calibration of Relative Sensor Orientation
The goal of this calibration procedure is to determine the relative orientation of the camera
and IMU by using gravity vectors. For each static pose in the camera calibration, a large
set of accelerometer measurements are recorded and then averaged. Under this motionless
condition, the accelerometers provide a measure of gravity. In the calibration grid coordinate system gravity is defined to point in the downward direction. Therefore, the extrinsic
parameters from camera calibration can be used to rotate gravity from the grid into the
camera’s coordinate system via
C i
agrav
G
= C
GRi agrav .
(3.28)
We are interested in the rotation matrix that aligns these two sets of vectors,
C ∗
agrav
=
h
C 1
agrav
C n
agrav
...
i
I ∗
agrav
,
=
h
I 1
agrav
...
I n
agrav
i
.
(3.29)
With a reasonable initial guess, a nonlinear minimization algorithm may be used by parameterizing the rotation matrix.
C
IR
∼
min E =
α,β,γ
C
agrav IaTgrav
I
agrav IaTgrav
−1
n
X
C
I R(α, β, γ) Iaigrav − Caigrav 2
(3.30)
(3.31)
i=1
A superior approach uses singular value decomposition of a moment matrix.
M=
n
X
C i
agrav
T
I i
agrav
(3.32)
i=1
{U, S, V } = svd(M )
C
IR
= UV T
(3.33)
(3.34)
No initial guess is required and the rotation matrix is not parameterized. More details of
this relative orientation method are provided in the next chapter.
38
Chapter 4
Single Camera, Single Frame Pose
The problem of determining pose from a single 2D-3D point set was introduced in the
last chapter. Because the discussion of the problem is so lengthy, it has been given its
own chapter. As a reminder, pose may be determined from a set of 2D image points with
corresponding 3D model points given the camera’s intrinsic parameters. First, basic concepts
and several classic algorithms are presented. Then, several experimental algorithms are
presented as alternatives.
4.1
Types of Algorithms
There is a long history of study in this area and many algorithms have been proposed.
Difficulty arises due to image point noise, nonlinearity of perspective projection, and the
constraint requirements of any 6-DOF pose parameterization. Linear algorithms can be very
fast but are typically not as accurate, [12], [3], [31]. In general, they attempt to relax the
pose constraints or move them outside of the image based minimization. Iterative algorithms
constrain the pose directly and are usually very accurate, [24]. Unfortunately, they require
a reasonably good initial guess for fast and proper convergence to the global minima.
Qualities of the 2D-3D point set itself may be used to distinguish many algorithms
from each other. In object space, the points may all lie on a plane, such as a calibration grid,
or they may form a non-planar configuration which might be arbitrary or of specific shape.
The number of points in the set is also very important. Some algorithms are designed for
a specific number of points, such as 3, 4, and 5 while others work for an arbitrary number
39
Figure 4.1: Error Spaces
of points. For an arbitrary number of points, the algorithm complexity is very important
because the number of points might become quite large. Finally, the algorithm will operate
in image or object space as seen in figure 4.1.
4.1.1
Image Space
The majority of all algorithms operate in image space, meaning that the algorithm attempts
to minimize a residual consisting of errors on the image plane, in two dimensions. Each
image point has two error components and the total error, E, for the entire set is written as
a sum of squares.
2 C
2
n C
X
Xi
Yi
E=
fu C + u0 − ui + fv C + v0 − vi
Zi
Zi
i=1
2 2
n X
r1∗ MPi + tx
r2∗ MPi + ty
fu
=
+ u0 − ui + fv
+ v0 − vi
r3∗ MPi + tz
r3∗ MPi + tz
i=1
(4.1)
The error equations are based on the perspective projection equations of chapter three. It
is worth noting that the error is not a true orthogonal projection.
40
4.1.2
Object Space
Fewer algorithms operate in object space for various reasons. One possible reason may be
that people are simply unfamiliar with the concept because it is not directly obvious from the
standard perspective projection equations. Image points are used to construct a projection
matrix which performs an orthogonal projection of a 3D camera point onto the direction
vector defined by the image point. The idea of applying the projection matrix to a computer
vision algorithm was first introduced by Lu and Hager in the Orthogonal Iteration Algorithm
[25]. The object space error function is defined as
n
X
(I − Fi ) CPi 2
E=
=
i=1
n
X
(4.2)
(I − Fi )
C
M
MR Pi
2
+ tM ,
C
i=1
where Fi is the projection matrix. It is constructed from “normalized” image points (chapter
3) so that
wi w T
(4.3)
Fi = T i .
wi wi
Recall that such points are unit focal length with image offsets removed, i.e.
   u −u 
0
i
ũi
fu
  

0
wi =  ṽi  =  vif−v
 .
v
1
1
(4.4)
However, if the effective focal lengths are equal, fu = fv = f , construction of the projection
matrix really only requires that the image offsets be removed. Further, normalized image
points may already be available from the undistorting procedure of chapter 3.
A quick derivation of the projection matrix, Fi , shows that it is a true orthogonal
projection which uses the dot product as shown below.
wi wiT C
Pi
kwi kkwi k
wi wiT CPi
← Dot Product
=
kwi k kwi k
= ŵi cos(θ) CPi Fi CPi =
41
(4.5)
If wi and CPi are colinear then ŵi = CP̂i and θ = 0. Clearly we have an identity. If not
colinear, an error exists,
2
2 ei = CPi − Fi CPi = CPi − ŵi cos(θ) CPi .
(4.6)
From the previous result it can be seen that 3D points lying further away of the
camera are weighted more heavily than close points; in fact, the weight is proportional to
distance. This may be considered an undesirable property that has deterred people from
object space algorithms, but it is not much of a problem if all points have nearly the same
depth.
4.2
Classic Algorithms
Several classic algorithms are presented in this section for two reasons: (1) reinforcement of
the different algorithm types discussed earlier, (2) establish a foundation for the experimental
algorithms section to follow.
4.2.1
Perspective-3-Point - Fischler & Bolles
A solution to the perspective-3-point problem was introduced by Fischler and Bolles in
their classic RANSAC paper, [13]. It is one of the easiest algorithms to understand and
visualize. In figure 4.2, the center of projection, L, and three object points, A, B, and C
form a tetrahedron. The law of cosines yields three equations that completely describe the
geometry of the tetrahedron,
d2ab = a2 + b2 − 2ab cos(θab )
(4.7)
d2ac = a2 + c2 − 2ac cos(θac )
(4.8)
d2bc = b2 + c2 − 2bc cos(θbc ) .
(4.9)
Everything is known except the lengths of the tetrahedron legs, a, b, and c. The image
points provide direction vectors to the model points in the camera frame,
ŵa =
wb
wc
wa
, ŵb =
, ŵc =
.
kwa k
kwb k
kwc k
42
(4.10)
Figure 4.2: Perspective Tetrahedron
The angles between the vectors may be easily computer by the dot product so that
cos(θab ) = ŵa · ŵb , cos(θac ) = ŵa · ŵc , cos(θbc ) = ŵb · ŵc .
(4.11)
Additionally, the distances of the tetrahedron base are known from the model points. The
three distances are computed as
d2ab = k MPa −
M
Pb k2 , d2ac = k MPa −
M
Pc k2 , d2bc = k MPb −
M
Pc k2 .
(4.12)
Two of the leg lengths can be expressed in terms of the third leg. For example, b = x a, c =
y a. Then, the original equations are rearranged so that
d2ab = a2 + x2 a2 − 2a2 x cos(θab )
d2ac = a2 + y 2 a2 − 2a2 y cos(θac )
(4.13)
d2bc = x2 a2 + y 2 a2 − 2a2 xy cos(θbc ) .
The three equations can be reduced to a 4th-order polynomial in either x or y. For example,
P4 x 4 + P3 x 3 + P 2 x 2 + P 1 x + P0 = 0 .
43
(4.14)
Of the four possible roots, we are only interested in those that are positive and real. Length
a is calculated for each valid root and then b and c are found next. There are at most eight
possible solutions for the tetrahedron but only four of them can be in a valid position in
front of the camera. The estimated location of the model points in the camera frame can be
calculated from the leg lengths and direction vectors,
C
Pa = a ŵa , CPb = b ŵb , CPc = a ŵc .
(4.15)
Finally, the pose of the model with respect to the camera may be found by aligning
the two congruent triangles in 3D, figure 4.3. An analytic solution can be found in [?]. Both
triangles are first translated to the origin and then rotated to lie in the same plane with
one side along a common axis. The alignment procedure requires three transformations (1
translation, 2 rotations) for each triangle. Results and illustration ( figure 4.3) provided
below.
"
#
"
#
C
M
Pi
Pi
= H6 H5 H4
H3 H2 H1
1
1
"
#
"
#
(4.16)
M
C
P
P
i
i
W
= W
MH
CH
1
1
C
MH
4.2.2
=
C
W
WH MH
(4.17)
Direct Linear Transform - Abdel-Aziz & Karara
The perspective-three-point problem has some obvious shortcomings including sensitivity to
noise and pose ambiguity. [42] provides a geometric explanation of configurations that lead to
ambiguity. These are the two primary factors which motivate the use of additional points. 4
and 5 point algorithms exist, [18], [37], but we want an algorithm that can handle an arbitrary
number of points, n ≥ 6. This is referred to as the perspective-n-point problem (PnP). The
pose ambiguity problem is resolved (under weak assumptions), but the sensitivity to image
noise remains. However, as the number of points increases, the sensitivity decreases. The
Direct Linear Transform (DLT) algorithm was first introduced by Abdel-Aziz and Karara [2].
It is a non-iterative algorithm originally intended to find a linear transformation from object
space to image space and estimate some of the camera’s intrinsic parameters. The linear
44
Figure 4.3: Congruent Triangle Alignment
transformation is a combination of the homogeneous transform,
K. The transformation of a single point is written as

M


ui − u0
r11 r12 r13
X i − X0





 vi − v0  = ci r21 r22 r23   MYi − Y0 
M
−d
r31 r32 r33
Zi − Z0

r11 MXi − X0 + r12 MYi − Y0 + r13

= ci r21 MXi − X0 + r22 MYi − Y0 + r23
r31 MXi − X0 + r32 MYi − Y0 + r33
C
MH,
and camera matrix,


Zi − Z0

M
Zi − Z0  .
M
Zi − Z0
M
(4.18)
The scale factor, ci , is isolated from the last line of the equation above,
ci =
−d
r31
( MX
i
− X0 ) + r32
( MY
i
− Y0 ) + r33 ( MZi − Z0 )
,
(4.19)
and terms are introduced for focal length and aspect ratio,
fu =
−d
−d
, fv =
.
λu
λv
45
(4.20)
The variables above are substituted into the original equation 4.18. Now only the top two
rows are of interest. Specifically,
r11 ( MXi − X0 ) + r12 ( MYi − Y0 ) + r13 ( MZi − Z0 )
− u0
r31 ( MXi − X0 ) + r32 ( MYi − Y0 ) + r33 ( MZi − Z0 )
(4.21)
r21 ( MXi − X0 ) + r22 ( MYi − Y0 ) + r23 ( MZi − Z0 )
− v0 .
r31 ( MXi − X0 ) + r32 ( MYi − Y0 ) + r33 ( MZi − Z0 )
(4.22)
ui = fu
vi = fv
Finally, the DLT parameters, l1 . . . l12 , are introduced by grouping terms in the equation
above. In this form, a model point can be projected to its location in image coordinates
given the eleven parameters.
ui =
l1 MXi + l2 MYi + l3 MZi + l4
l9 MXi + l10 MYi + l11 MZi + 1
(4.23)
vi =
l5 MXi + l6 MYi + l7 MZi + l8
l9 MXi + l10 MYi + l11 MZi + 1
(4.24)
The equations are rearranged once more into a set of linear error equations,
ui = l1 MXi + l2 MYi + l3 MZi + l4 − ui (l9 MXi + l10 MYi + l11 MZi )
(4.25)
vi = l5 MXi + l6 MYi + l7 MZi + l8 − vi (l9 MXi + l10 MYi + l11 MZi ) ,
(4.26)
which may be written in matrix form as
"
Mi =
M
Xi
0
M
Yi
0
M
Zi 1
0 0
0
M
Xi
0
M
Yi
0 0 −ui MXi − ui MYi − ui MZi
M
Zi 1 −vi MXi − vi MYi − vi MZi
#
.
(4.27)
~ equal the vector of transform coefficients,
Let L
h
~ = l1 . . .
L
so that
l11
" #
~ = ui .
Mi L
vi
46
iT
(4.28)
(4.29)
A large linear system is constructed from the error equations for the entire point set,


M1
h
 . 
..  , W = u1 v1 . . .
M =
 
Mn
un vn
iT
.
(4.30)
The projection theorem is used to calculate the least-squares solution of the DLT parameters.
~ = (M T M )−1 M T W
L
(4.31)
Note that a minimum of six points are required to solve for the eleven unknown parameters.
If many images are available and used in the same optimization, some of the camera’s
intrinsic parameters may be estimated from the DLT parameters. However, for a single
image, the intrinsic parameters must be known a priori and the image points should be
normalized before constructing the Mi matrices. When calculating pose, the parameters
are “clamped” to the correct values, fu − fv = 1 and u0 = v0 = 0. A list of all the DLT
parameters and connection to the original transformation are provided in the appendix.
The camera orientation is calculated directly from the DLT parameters via
0 l9 −l1
fu
u
C
MR

= D  v0 lf9v−l5
u0 l10 −l2
fu
v0 l10 −l6
fv
l9

l10
where
D2 =
u0 l11 −l3
fu
v0 l11 −l7 

fv
1
l92
+
2
l10
2
+ l11
,
(4.32)
l11
.
(4.33)
It is important to recognize that no orthogonality constraint is placed on the elements of
the rotation matrix. With even relatively small amounts of noise, the matrix can be far
from orthogonal. This is a big problem! Orthogonalizing a matrix is not a trivial task. The
camera position is also calculated from the DLT parameters via
  
−1  
X0
l1 l2 l3
−l4
  
  
 Y0  = l5 l6 l7  −l8  .
Z0
l9 l10 l11
−1
(4.34)
However, equation 4.18 is not consistent with the standard convention for transformation of
47
a point. Specifically, a point should be rotated and then translated, not the opposite order.
Therefore, the correct position requires a rotation by the matrix found above, which might
not be orthogonal.
 
X0


C
tm = − C
(4.35)
MR  Y0 
Z0
4.2.3
“Absolute Orientation” - Horn
The problem of aligning a set of vectors expressed in two different coordinate systems has
been given the name “Absolute Orientation”. This is extremely useful for many different
tasks, such as aligning gravity vectors in the IMU calibration procedure discussed in chapter
three. Another example was seen in the last step of the P3P problem, alignment of congruent triangles. Notice that the triangle alignment problem is slightly different because
translation is included as well. For many computer vision algorithms absolute orientation
is a fundamental step. A set of 3D points are first estimated in the camera frame and then
aligned with the same point set expressed in the model. Of course, the task is complicated
by noise; the size and shape of the two point clouds might be slightly different.
Two popular methods were introduced by Horn et al [19],[20] that produce similar
results. They differ in the way that rotation is calculated. Both separate the calculation of
rotation and translation into two steps and require calculation of the centroid of both point
clouds. The central idea is that the vectors created from the centroid to all of the points
in the set provide directional information. In a noiseless system, the centroid or any of the
points in the set could be used, but with noise added the centroid has been shown to be the
optimal choice. Calculation of both centroids is straightforward and relatively fast.
n
1X
P̄ =
n i=1
M
n
M
Pi ,
1X C
P̄ =
Pi
n i=1
C
(4.36)
However, if absolute orientation is to be performed repeatedly, such as within an iterative
algorithm, to save time it is logical to store the value of the centroid in the model frame so
that it is only calculated once. Translation may be calculated directly from the rotation and
both centroids.
C
M
tM = CP̄ − C
(4.37)
MR P̄
48
4.2.3.1
Optimal Rotation - SVD
The SVD method is somewhat intuitive. The principle directions of a moment matrix,
M=
n
X
C
Pi − CP̄
M
Pi −
M
P̄
T
,
(4.38)
i=1
are found through singular value decomposition,
{U, S, V } = svd(M ) ,
(4.39)
and then the (orthogonal) rotation is recovered via
C
MR
4.2.3.2
=UVT .
(4.40)
Optimal Rotation - Quaternion
A problem can occur in the SVD method when a significant level of noise exists. As a
result the orthogonal matrix of U V T may have a determinant equal to −1 which indicates
it is not special orthogonal (chapter 2). [40] provides a way to detect and fix the problem.
The quaternion method always produces a proper rotation matrix, but it is slightly more
expensive. A 4-dimensional moment matrix is constructed from multiplication of right and
left sided skew symmetric matrices,
M=
n
X
i=1
"
QL
1
C
Pi − CP̄
#∗ !
"
QR
1
M
Pi −
#!
M
P̄
,
(4.41)
and then decomposed into a set of eigenvectors and eigenvalues,
V = eig(M ) .
(4.42)
The eigenvector with largest eigenvalue is the quaternion representing the rotation,
C
MR
= R(V∗4 ) .
49
(4.43)
4.2.4
Orthogonal Iteration Algorithm - Lu & Hager
The Orthogonal Iteration Algorithm [25] is based on the absolute orientation concept and the
projection matrix, F , of equation 4.5. The algorithm is fast and globally convergent (under
weak assumptions). Lu and Hager introduced the idea of applying the projection matrix
to a computer vision algorithm, but even more novel is the proof of convergence within a
derivative-free minimization of the object space residual of equation 4.2. The algorithm is
very simple and operates by alternating between optimal translation and optimal rotation
calculations.
After expanding the object space error function of 4.2,
E=
n
X
(Fi − I)
C
M
MR Pi
+
i=1
n
X
(Fi − I) CtM ,
(4.44)
i=1
derivation of the optimal rotation is obvious.
n
X
C
tM −
i=1
n
X
i=1
n CtM −
i=1
n
X
C
n
X
C
i=1
n
X
Fi tM =
!
Fi
1
I−
n
1
C
tM (R) =
n
n
X
tM =
!
n CtM =
Fi
i=1
n
X
M
(Fi − I) C
MR Pi
M
(Fi − I) C
MR Pi
(4.45)
M
(Fi − I) C
MR Pi
i=1
i=1
n
1X
I−
Fi
n i=1
!−1
n
X
!
M
(Fi − I) C
MR Pi
(4.46)
i=1
This leads to an estimate of 3D camera point locations which uses the project matrix,
n
C
Pi (t) = Fi
C
M
MR Pi
C
+ tM
1XC
Pi (t) .
, P̄ (t) =
n i=1
C
(4.47)
The optimal rotation can be found using either of the two absolute orientation methods
above. The SVD method is more intuitive and provided below for clarity.
M (t) =
n X
C
C
Pi (t) − P̄ (t)
i=1
50
M
Pi −
M
P̄
T (4.48)
{U, S, V } = svd(M (t)),
C
MR
= UV T
(4.49)
The authors use the “Global Convergence Theorem” to show that a solution will
eventually be reached from an arbitrary starting point. Although global convergence does
not guarantee that the true pose will always be recovered, it does suggest that the true pose
can be reached from a very broad range of initial guesses.
4.3
Experimental Algorithms
Several experimental algorithms have been created as alternatives to the algorithms presented previously. The algorithms have a linear non-iterative stage followed by an nonlinear
iterative stage for refinement. Additionally, the algorithms operate by creating a null space
for which the pose parameterization lives. For example, the image space error equations 4.1
are rearranged to form a null space for points in the camera,
eiu = fu CXi + (u0 − ui ) CZi
(4.50)
eiv = fv CYi + (v0 − vi ) CZi .
(4.51)
Similarly, the object space error equations 4.2 are rearranged to form a null space for points
in the camera,
eix = CXi − f1∗ CPi
= (f11 − 1) CXi + f12 CYi + f13 CZi
eiy = CYi − f2∗ CPi
= f21 CXi + (f22 − 1) CYi + f23 CZi
eiz = CZi − f3∗ CPi
= f31 CXi + f32 CYi + (f33 − 1) CZi .
4.3.1
(4.52)
(4.53)
(4.54)
Modified DLT + Gauss-Newton
The Modified Direct Linear Transform (MDLT) algorithm has been created to overcome some
of the short comings of the DLT method presented earlier. Unlike the DLT, the algorithm is
51
designed to find pose only and not estimate any of the camera’s intrinsic parameters. The
algorithm consists of two stages: form an initial guess and then iterate using Gauss-Newton.
4.3.1.1
Linear Transformation
The initial guess is found by forming a linear system and relaxing the rotation constraints
of the pose. The unconstrained transform will simply be referred to as a linear transform,
L, so that
"
#
"
#
C
M
P∗
P∗
=L
.
(4.55)
1
1
The ordering of the individual coefficients of the linear transform, l1 . . . l12 , is shown through
the transformation of a single point,


 
M
Xi
Xi
l1 l2 l3 l4
C  


 Yi  l5 l6 l7 l8   MYi 

=


 CZ  l l
  MZ  .
l
l
i
 i   9 10 11 12  
1
1
0 0 0 1

C
(4.56)
The equation is simply expanded to show how model points in the camera frame are expressed
in terms of the linear transform,
C
Xi = l1 MXi + l2 MYi + l3 MZi + l4
C
Yi = l5 MXi + l6 MYi + l7 MZi + l8
(4.58)
Zi = l9 MXi + l10 MYi + l11 MZi + l12 .
(4.59)
C
4.3.1.2
(4.57)
Image Space System
The linear transform coefficients are substituted into the null space equations of 4.50 which
yields
eiu = fu (l1 MXi + l2 MYi + l3 MZi + l4 )
+ (u0 − ui )(l9 MXi + l10 MYi + l11 MZi + l12 )
52
(4.60)
eiv = fv (l5 MXi + l6 MYi + l7 MZi + l8 )
+ (v0 − vi )(l9 MXi + l10 MYi + l11 MZi + l12 ) .
(4.61)
Notice that the equations are linear in the coefficients and may therefore be written in matrix
form,
#
"
Mai 0 Mbi
,
(4.62)
Mi =
0 Mci Mdi
where
h
i
Mai = fu MXi fu MYi fu MZi fu
Mbi
(4.63)
h
i
M
M
M
= (u0 − ui ) Xi (u0 − ui ) Yi (u0 − ui ) Zi (u0 − ui )
(4.64)
h
i
M
M
M
= f v Xi f v Yi f v Z i f v
(4.65)
and
Mci
h
i
Mdi = (v0 − vi ) MXi (v0 − vi ) MYi (v0 − vi ) MZi (v0 − vi ) .
(4.66)
The equations for the entire point set are used to construct a large matrix referred to as the
~ equal the vector of linear transform coefficients,
“measurement matrix”. Let L
h
~
L = l1 . . .
so that
l12
iT
 
" #
M1
i


e
~ = u , M =  ...  .
Mi L
 
eiv
Mn
(4.67)
(4.68)
The large matrix M creates a null space for the linear transform coefficients so that
~ ∈ R12 | M L
~ = 0} .
V = {L
4.3.1.3
(4.69)
Object Space System
For several reasons it might be desirable to also have a measurement matrix constructed
in object space. First, we would like to be able to test the convergence properties of the
two error spaces. Maybe the Orthogonal Iteration Algorithm converges well because it
minimizes an object space residual. Secondly, it is logical that the initial guess into an
iterative algorithm be from a minimization that occurred in the same error space. The linear
53
transform coefficients are substituted into the null space equations of 4.52 which yields
eix = (f11 − 1)(l1 MXi + l2 MYi + l3 MZi + l4 )
+ f12 (l5 MXi + l6 MYi + l7 MZi + l8 )
(4.70)
+ f13 (l9 MXi + l10 MYi + l11 MZi + l12 )
eiy = f21 (l1 MXi + l2 MYi + l3 MZi + l4 )
+ (f22 − 1)(l5 MXi + l6 MYi + l7 MZi + l8 )
(4.71)
+ f23 (l9 MXi + l10 MYi + l11 MZi + l12 )
eiz = f31 (l1 MXi + l2 MYi + l3 MZi + l4 )
+ f32 (l5 MXi + l6 MYi + l7 MZi + l8 )
(4.72)
+ (f33 − 1)(l9 MXi + l10 MYi + l11 MZi + l12 ) .
Therefore, the object space measurement matrix has 3 n rows whereas the image space matrix
has 2 n rows.
4.3.1.4
Null Space Solution
Regardless of which error space is used, the null space solution may be computed by eigenvalue decomposition,
V = eig(M T M ) .
(4.73)
The multiplication of M T M can be a rather expensive operation when the number of points
is very large. If noise is present, there will not be a strictly zero eigenvalue. Instead, we are
only interested in the eigenvector, V∗1 , with the smallest eigenvalue so that
L̃ = reshape(V∗1 , 3, 4) .
(4.74)
It should be obvious that an arbitrary (nonzero) scaling exists because scalar multiplication
of any vector within the null space still lies in the null space, i.e. kβ M V∗1 k2 = β k M V∗1 k2 =
k M V∗1 k2 = 0. However, in this context, the scale factor, β is not arbitrary and must be
determined for the algorithm to work properly. Specifically, the scale is needed to compute
an estimate of the model points in the camera frame within
P∗ = β L̃ MP∗ = β L̃ MP∗ = β CS∗ ,
C
54
(4.75)
because both point clouds should be approximately the same shape and size. Fortunately, the
scale may be recovered easily, in a least-squares sense. Let CS∗ equal the set of improperly
scaled model points in the camera frame. The centroid of each set is computed by
n
1XC
S̄ =
Si ,
n i=1
C
n
1XM
P̄ =
Pi ,
n i=1
M
(4.76)
and then the sum of squared distance about the centroid is calculated via
n
X
C
Si − CS̄ 2 ,
d=
C
n
X
M
Pi −
d=
M
2
P̄ .
M
(4.77)
i=1
i=1
The scale factor is the square root of the distance ratio,
r
C
β = sgn( Z̄)
Md
Cd
,
(4.78)
where the sgn( CZ̄) is needed to correct for camera points that might lie behind the camera,
in the negative Z direction instead of positive.
After the scale factor is recovered, the absolute orientation concept is applied to find
the relative transformation between the point sets,
"
#
P∗
=
1
"
C
C
MH
M
P∗
1
#
.
(4.79)
Note, in this way the resulting rotation is forced to be orthogonal unlike in the standard
DLT method.
4.3.1.5
Final Optimization
The pose solution above provides a good initial guess that was found non-iteratively in a
linear fashion. The initial guess could be used in any iterative algorithm but it makes sense
to use it within the framework that has already been developed.
Recall that the moment matrix, M , will be of size 2 n x 12 or 3 n x 12. The matrix can
actually be decomposed into a reduced size of 12 x 12. From SVD we obtain
M T M = (V SU T )(U SV T ) = (V S)(SV )T = M̂ T M̂ .
55
(4.80)
This is a very significant reduction if the number of points is large. QR decomposition is perhaps a quicker way to compute the reduced measurement matrix, M̂ . It can be shown, [14],
that minimizing a residual with the reduced measurement matrix is the same as minimizing
a residual with the full size matrix,
~ = min kM Lk
~ .
min kM̂ Lk
~
L
(4.81)
~
L
Sub-optimality of the non-iterative linear solution might be attributed in part to the
relaxation of the orthogonality constraint. Specifically, the linear transform is allowed to have
12-DOF. Fortunately, it is easy to restrict the degrees of freedom in an iterative algorithm
by parameterizing the pose (linear transform above) as covered in chapter two,
R(Θ) t(Θ)
L(Θ) =
0
1
.
(4.82)
For example, Euler angles and three translation elements might be used so that
h
i
Θ = α β γ tx ty tz .
(4.83)
Now the transform parameters, l1 . . . l12 , only have 6-DOF,
h
~
L(Θ)
= l1 . . .
l12
iT
.
(4.84)
A standard Gauss-Newton algorithm may be used and will have good convergence
properties if the initial guess is in the neighborhood of the global minima.
~
f (Θ) = M̂ L(Θ)


Jf (Θ) = 

∂f1 (Θ)
∂Θ1
..
.
∂f12 (Θ)
∂Θ1
(4.85)
...
..
.
∂f1 (Θ)
∂Θ6
...
∂f12 (Θ)
∂Θ6
Θk+1 = Θk − Jf (Θk )T Jf (Θk )
..
.
−1




Jf (Θk )T f (Θk )
(4.86)
(4.87)
Note, by using the reduced measurement matrix, the Jacobian computed each iteration will
always be of size 12 x 6 no matter how many points are in the full set. A significant time
reduction occurs if many points and iterations are used.
56
4.3.2
3-Simplex + Gauss-Newton
The idea of using a 3-simplex in a computer vision algorithm was first introduced by MorenoNoguer et al [27] in 2007. Many people are not familiar with the simplex and the connection
to a computer vision algorithm was novel. A quick overview of the 3-simplex is covered to
provide a foundation for the algorithm to follow.
4.3.2.1
3-Simplex and Barycentric Coordinates
The previous notation for model points is modified slightly,
M
Pmi =
h
M
i
Xm
M
Ymi
M
i
Zm
iT
,
Pm∗ =
M
h
M
Pm1
M
Pmn
...
i
,
(4.88)
by attaching the “m” subscript to distinguish it from vertices of the 3-simplex, “s”,
M
Psi =
h
M
Xsi
M
Ysi
M
Zsi
iT
,
Ps∗ =
M
h
M
Ps1
M
Ps2
M
Ps3
i
Ps4 .
M
(4.89)
Barycentric coordinates describe the location of points with respect to the vertices of the
simplex,
#
# "
"
M ∗
M ∗
Ps
Pm
α.
(4.90)
=
1
1
They may be quickly computed in a single matrix operation,
"
α=
Ps∗
1
M
#−1 "
Pm∗
1
M
#
.
(4.91)
Vertices that are coplanar create a singular matrix and should therefore not be used. When
the average of the barycentric coordinates is applied to the simplex vertices, the centroid (or
barycenter) of the point set is found.
The utility of the simplex in the context of a computer vision algorithm is that the
relationship between vertices, points, and barycentric coordinates is rotation and translation
invariant, as the following proof suggests.
"
#
Pm∗
=
1
"
C
C
MH
#
Pm∗
=
1
"
M
C
MH
57
#
"
#
C ∗
Ps∗
Ps
α=
α
1
1
M
(4.92)
This expression shows the relationship of the barycentric coordinates and simplex
vertices to a single point

 
i
C 1
Xm
Xs
 C i  C 1
 Ym   Ys

 
 CZ i  =  CZ 1
 m  s
1
1
C
C
Xs2
C 2
Ys
C 2
Zs
1
C
Xs3
C 3
Ys
C 3
Zs
1
 
Xs4
αi1
 
C 4 
Ys  αi2 
 .
C 4 
Zs  αi3 

1
αi4
C
(4.93)
The expression is simply expanded and regrouped into sums so that
C
i
Xm
= CXs1 + αi2 CXs2 + αi3 CXs3 + αi4 CXs4
=
4
X
(4.94)
αij CXsj
j=1
C
Ymi = CYs1 + αi2 CYs2 + αi3 CYs3 + αi4 CYs4
=
4
X
(4.95)
αij CYsj
j=1
C i
Zm
= CZs1 + αi2 CZs2 + αi3 CZs3 + αi4 CZs4
=
4
X
.
αij CZsj
(4.96)
j=1
Therefore, what may have seemed like a redundancy before is actually a constraint, i.e. the
sum of each barycentric coordinate set must equal +1.
αi1 + αi2 + αi3 + αi4 =
4
X
αij = 1
(4.97)
j=1
In the MDLT method the coefficients of a linear transform are estimated, but in this algorithm, the vertices of the simplex in the camera frame are estimated. The equations above
show how the model points may be expressed in terms of the simplex vertices and barycentric
coordinates.
58
4.3.2.2
Measurement Matrices
The projection null space equations 4.50 and 4.52 are used again so that a linear system is
created to minimize a residual in either image or object space. Like in the MDLT method, a
measurement matrix, M , is constructed from smaller Mi matrices. Let CP~s equal the vector
of simplex vertices in the camera,
C~
Ps
=
h
C
Xs1
...
C 4
Zs
iT
.
(4.98)
This time the vector of simplex vertices should lie in the null space,
V = { CP~s ∈ R12 | M CP~s = 0} .
(4.99)
The simplex vertices and barycentric coordinates are substituted into the null space equations
of 4.50 which yields
eiu = fu
4
X
αij CXsj + (u0 − ui )
j=1
=
4
X
4
X
αij CZsj
j=1
(4.100)
fu αij CXsj + (u0 − ui )αij CZsj
j=1
eiv
= fv
4
X
αij
C
Ysj
+ (v0 − vi )
j=1
=
4
X
4
X
αij CZsj
j=1
fv αij
C
Ysj
+ (v0 −
vi )αij CZsj
(4.101)
.
j=1
The equations are linear in the vertices and may be written in the form of a measurement
matrix
"
#
fu αi1
0
(u0 − ui )αi1 . . . fu αi4
0
(u0 − ui )αi4
Mi =
(4.102)
0
fv αi1 (v0 − vi )αi1 . . .
0
fv αi4 (v0 − vi )αi4
so that
" #
ei
Mi CP~s = ui .
ev
59
(4.103)
The simplex vertices and barycentric coordinates are substituted into the null space equations
of 4.52 which yields
eix
= (f11 − 1)
4
X
αij
C
Xsj
+ f12
4
X
j=1
=
4
X
αij
C
Ysj
+ f13
j=1
4
X
αij CZsj
j=1
(4.104)
(f11 − 1) αij CXsj + f12 αij CYsj + f13 αij CZsj
j=1
eiy
= f21
4
X
αij
C
Xsj
+ (f22 − 1)
4
X
j=1
=
4
X
αij
C
Ysj
+ f23
j=1
4
X
αij CZsj
j=1
(4.105)
f21 αij CXsj + (f22 − 1) αij CYsj + f23 αij CZsj
j=1
eiz
= f31
4
X
αij
C
Xsj
j=1
=
4
X
+ f32
4
X
αij
C
Ysj
+ (f33 − 1)
j=1
4
X
αij CZsj
j=1
(4.106)
f31 αij CXsj + f32 αij CYsj + (f33 − 1) αij CZsj .
j=1
Therefore, the object space measurement matrix has 3 n rows whereas the image space matrix
has 2 n rows.
4.3.2.3
Null Space Solution
The projection null space is found in the same way as in the MDLT algorithm, V =
eig(M̂ T M̂ ), and the single eigenvector result is very similar
P̃s∗ = reshape(V∗1 , 3, 4) .
C
(4.107)
Again, an unknown scale factor, β, exists within
Pm∗ = β CP̃s∗ α = β
C
∗
P̃s∗ α = β CSm
.
C
(4.108)
The scale factor is recoved like in the MDLT, and an estimate of model points in the camera
frame is produced. Finally, the relative transformation is computed using the absolute
orientation method.
60
4.3.2.4
Final Optimization
The linear non-iterative solution for the simplex vertices suffers from the same problem as
the MDLT; the simplex vertices are given 12-DOF even though they really only have six.
For example, distances between all vertex pairs should be the same in the model and camera
frames. Six constraints are placed on twelve parameters leaving only 6-DOF.
Instead of using an arbitrary simplex it is logical to choose a specific shape, such as a
unit length right-handed orthogonal triad. One possible method of constructing such a triad
is by decomposing the model point set. The triad origin is placed at the centroid,
n
1XM i
P̄m =
Pm ,
n i=1
M
(4.109)
and the legs are made to point in the principle directions via
{U, S, V } = svd
M
SR
h
M
PmT
− 1 ...
h
i h
= V1 V2 ±V3 = MX̂s
iT
1
M
P̄mT
M
M
Ŷs
Ẑs
(4.110)
i
(4.111)
such that det(R) = +1. If chosen like this another reference frame, {S}, has been effectively
created,
M M P̄m
M
SR
.
(4.112)
SH =
0
1
More analysis is needed to determine if this is the optimal selection of the simplex.
Four points are required to define an orthogonal triad which can move freely in space.
One point defines the position and the other three define the directions along the legs. The
following expression shows how the rotation and translation of the homogeneous transform
above can be used to construct an orthogonal triad.
Ps∗
M
=
h
M
P̄m
M
P̄m +
M
X̂s
M
P̄m +
M
Ŷs
M
P̄m +
M
Ẑs
i
(4.113)
Therefore, the orthogonal triad has the following properties when expressed in any reference
frame.
Ps2 − Ps1 · Ps3 − Ps1 = Ps2 − Ps1 · Ps4 − Ps1 = Ps3 − Ps1 · Ps4 − Ps1 = 0
61
(4.114)
2
Ps − Ps1 = Ps3 − Ps1 = Ps4 − Ps1 = 1
(4.115)
Euler angles may be used to parameterize the rotational components of the traid.
R(α, β, γ) = C
SR
(4.116)
h
iT
Ps1 = tx ty tz
(4.117)
h
iT
Θ = α β γ tx ty tz
(4.118)
C
The Gauss-Newton algorithm for the 3-Simplex is identical in form to the MDLT.
Of course, the measurement matrix and parameterization are different but the number of
degrees of freedom is the same.
f (Θ) = M̂ CP~s (Θ)
(4.119)
If a unit length orthogonal triad is used than the final pose solution may be quickly computed.
If it is not, absolute orientation will likely need to be performed.
C
MH
4.3.3
=
C
S
SH MH
=
M
SR
1
0
C 1
C
Ps
SR
0
M
Ps1
1
−1
(4.120)
3-Simplex SDP
The linear (non-iterative) solutions of the MDLT and 3-simplex typically provide an initial
guess which is good enough for an iterative algorithm to converge properly to the global minima. However, as noise levels increase this might not be a valid assumption. The algorithm
presented here is a technique to produce a more accurate result from the same framework.
However, it is iterative and inevitably slower than the linear solutions.
The algorithm uses the 3-simplex, multiple eigenvectors, and a polynomial semidefinite
program (SDP). The linear transform of the MDLT is not applicable because of the way the
degrees of freedom are spread throughout the twelve parameters. Specifically, six constraints
are placed on the nine rotational parameters while the three translational parameters are left
somewhat disconnected or unconstrained. The 3-simplex is different because six constraints
are placed over all twelve parameters together so that rotation and translation are tightly
coupled.
Recall from earlier that only the eigenvector from V = eig(M̂ T M̂ ) with the smallest
62
eigenvalue was of interest. Only in a noiseless system will there be a perfectly zero eigenvalue.
However, with noise, all of the eigenvectors provide “some” information in decreasing order
of significance as the size of the eigenvalue increases. This algorithm attempts to use the
first two, three, or four eigenvectors,
va = reshape(V∗1 , 3, 4) , vb = reshape(V∗2 , 3, 4)
(4.121)
vc = reshape(V∗3 , 3, 4) , vd = reshape(V∗4 , 3, 4) .
in a linear combination. Up to four scale factors, βa , βb , βc , and βd , are used within the
vector combinations
P̃s∗ = βa va + βb vb
C
P̃s∗ = βa va + βb vb + βc vc
C
(4.122)
P̃s∗ = βa va + βb vb + βc vc + βd vd .
C
The betas are calculated with non-linear shape constraints that could not be imposed during
the eigenvalue decomposition. A linear combination of vectors within the null space produces
another vector which also lies in the null space. i.e. 0 = M (βa va ) = M (βb vb ) = M (βc vc ) =
M (βd vd ) = M (βa va + βb vb + βc vc + βd vd ). An estimated set of model points in camera
frame can be computed for each combination,
Pm∗ = CP̃s α ,
C
(4.123)
and then the absolute orientation method is applied for the relative transformation.
4.3.3.1
Distance Constraint
The polynomial SDP relies on (quadratic) constraints that are rotation and translation
invariant. Euclidean distance between vertex pairs is a logical choice, and a set of four
4!
= 6 possible combinations. In a noiseless system, distances in the model
vectices has (4−2)!
2!
and camera frames should be equal,
C i C j 2 M i
Ps − Ps = P s −
2
Psj .
M
(4.124)
However, with noise present, an algorithm can be used to minimize any error by comparing
distances in both frames for all l = 1 . . . 6 vertex combinations. The two variable/vector
63
case is provided as an example.
2
dl = (βa vai + βb vbi ) − (βa vaj + βb vbj )
2
= βa (Xai − Xaj ) + βb (Xbi − Xbj )
2
+ βa (Yai − Yaj ) + βb (Ybi − Ybj )
2
+ βa (Zai − Zaj ) + βb (Zbi − Zbj )
C
dl = MPsi −
M
4.3.3.2
(4.125)
2
Psj M
(4.126)
”Linearization” + Gauss-Newton
In Moreno-Noguer et al [27], a two step procedure is suggested for finding the correct scale
factors, betas, of the eigenvector combination. In the first step, the six distance constraints
are used in a linear system where all combinations of betas are decoupled and treated as
independent variables.
C β = Md
(4.127)
−1 T M
C d
(4.128)
β = CT C
This is problematic and becomes ill-conditioned for more than three variables. The two
variable/vector case is provided here as an example.
C
l
l
l
dl = C20
βa2 + C02
βb2 + C11
βa βb


 2 
1
1
1
C11
C02
C20
βa

 .

.
.
.
..
..  , β =  βb2 
C=
,

 .
6
6
6
βa βb
C20
C02
C11
(4.129)


d1
 . 
M
. 
d=
 . 
M
d6
M
(4.130)
The second step is a Gauss-Newton algorithm that uses the results of the first step
as an initial guess. Because of the way the linearization operates, it will often produce an
initial guess that is very far from the global minima. Hence, the Gauss-Newton algorithm is
likely to diverge.
2
fl (β) = Cdl − Mdl
(4.131)
−1
β k+1 = β k − JF (β k )T JF (β k )
JF (β k )T F (β k )
(4.132)
64
Again, the two variable/vector case is provided as an example.
l
l
l
βa βb −
βb2 + C11
βa2 + C02
fl (βa , βb ) = C20
M
dl



∂f1 (βa ,βb )
" #
f1 (βa , βb )


 ∂β. a
βa
..



..
β=
, F (β) = 
.
 , JF (β) = 
βb
∂f6 (βa ,βb )
f6 (βa , βb )
∂βa
4.3.3.3
2
(4.133)
∂f1 (βa ,βb )
∂βb
..
.
∂f6 (βa ,βb )
∂βb




(4.134)
Dot Product Constraint
The dot product is another rotation and translation invariant quadratic constraint. Unlike
distance, the dot product constraint requires a vertex triplet, and a set of four vectices has
4!
= 12 possible combinations. In a noiseless system, the direction cosine that the dot
(4−3)! 2!
product represents should be equal in the camera and model frames,
C
Psi − CPsk ·
C
Psj − CPsk =
M
Psi −
M
Psk ·
M
Psj −
M
Psk .
(4.135)
Dot products are computed for all l = 1 . . . 12 sets. The two variable case is provided as an
example.
dl = (βa vai + βb vbi ) − (βa vak + βb vbk ) · (βa vaj + βb vbj ) − (βa vak + βb vbk )
= βa (Xai − Xak ) + βb (Xbi − Xbk ) βa (Xaj − Xak ) + βb (Xbj − Xbk )
+ βa (Yai − Yak ) + βb (Ybi − Ybk ) βa (Yaj − Yak ) + βb (Ybj − Ybk )
+ βa (Zai − Zak ) + βb (Zbi − Zbk ) βa (Zaj − Zak ) + βb (Zbj − Zbk )
C
M
dl =
4.3.3.4
M
Psi −
M
Psk ·
M
Psj −
M
Psk
(4.136)
(4.137)
Orthogonal Triad Constraints
If the simplex is an orthogonal triad (previous section), a specific set of constraints may be
used to define its shape. The set of six constraints includes three distances,
C
Ps2 − C Ps1 ·
C
d2 = C Ps2 − C Ps1 ·
C
d3 = C Ps3 − C Ps1 ·
d1 =
C
65
C
Ps3 − C Ps1
C
Ps4 − C Ps1
C
1
Ps4 − C Ps
(4.138)
M
d1 =
M
d2 =
M
d3 = 0 ,
(4.139)
and three dot products
2
d4 = C Ps2 − C Ps1 2
C
d5 = C Ps3 − C Ps1 2
C
d6 = C Ps4 − C Ps1 C
M
d4 =
4.3.3.5
M
d5 =
M
d6 = 1 .
(4.140)
(4.141)
Polynomial Semidefinite Program
An important observation is made regarding the quadratic constraints with respect to the
betas. Specifically, the sum of squared distance or dot product error in the camera and
model frames is in fact a multivariate polynomial in the betas.
E=
6
X
C
dl (β) −
M
dl
2
(4.142)
l=1
Lasserre [23] shows how minimizing a polynomial that is a sum-of-squares can be transformed into a convex linear matrix inequality (LMI) problem and then efficiently solved in
a semidefinite program (SDP). Hence, the name that was given to the algorithm, “Simplex
SDP”.
A polynomial is a finite linear combination of monomials with real-valued coefficients.
If all of the monomials are of the same degree, the polynomial is called homogeneous. A
multivariate polynomial is a sum-of-squares if it can be written as a sum of squares of other
polynomials. In general, sum-of-squares decompositions are not unique. A deceptively simple
sufficient condition for a real-valued polynomial to be globally nonnegative (convex) is the
existence of a sum of squares decomposition.
SDP is a broad generalization of linear programming (LP), to the case of symmetric
matrices. A linear objective function is optimized over linear equality constraints with a
semidefinite matrix condition. The LMI minimization problem is written in standard form
as
min E = P T X , such that M ≥ 0
(4.143)
X
where E is the objective function and M is the semidefinite matrix of the linear inequality
66
constraint. The key feature of SDPs is their convexity, since the feasible set defined by the
constraints is convex. In this context, the SDP is used to solve convex LMI relaxations of
multivariate real-valued polynomials. Specifically, the quadratic shape constraints on the
eigenvector combinations are decomposed into monomials of the betas when written as a
sum of squares. The Gauss-Newton algorithm of Moreno-Noguer et al [27] does not have
guaranteed convergence, but the SDP framework presented here does.
As shown earlier, the distance and dot product constraints in the camera may be
written as a polynomial in the betas with coefficients C. The two variable case is provided
as an example.
l
l
l
βa βb −
βb2 + C11
βa2 + C02
El = C20
M
dl
2
l
l
l
l
= P20
βa2 + P31
βa3 βb + P40
βa4 + P02
βb2
(4.144)
l
l
l
l
+ P13
βa βb3 + P04
βb4 + P11
βa βb + P22
βa2 βb2
When the error is squared the coefficients are transformed into a higher dimensional space.
Effectively, the quadratics becomes a quartic. The following expression outlines the mapping
to squared error, C → P .
l
l M
P20
= −2 C20
dl
l
l
l
P31
= 2 C11
C20
l
l 2
P40
= (C20
)
l
l M
P02
= −2 C02
dl
(4.145)
l
l
l
P13
= 2 C02
C11
l
l 2
P04
= (C02
)
l
l M
P11
= −2 C11
dl
l
l 2
l
l
P22
= (C11
) + 2 C02
C20
Now the objective function may be written in linear form, E = P T X, where
h
iT
X = βa2 βa3 βb βa4 βb2 βa βb3 βb4 βa βb βa2 βb2
h
P = P20 P31 P40 P02 P13 P04 P11 P22
iT
.
(4.146)
(4.147)
In this algorithm, semidefinite matrix M is actually constructed from three smaller
67
matrices that are semidefinite themselves,


M0 0
0


M =  0 M1 0  .
0
0 M2
(4.148)
M0 is a vector weighting constraint used to force the magnitude of the first eigenvector, βa ,
to be larger than all others. M1 and M2 are both “moment” matrices that are constructed
from
M1 = X1 X1T
(4.149)
M2 = X2 X2T .
(4.150)
The LMI constraints of the two variable case are provided as an example.
h
iT
X1 = βa βb
iT
h
X2 = 1 βa2 βb2 βa βb
h
i
M0 = βa2 − βb2
"
#
βa2 βa βb
M1 =
βa βb βb2

1
βa2
 2
 βa
βa4
M2 = 
 β 2 β 2β 2
 b
a b
βa βb βa3 βb
βb2
βa2 βb2
βb4
βa βb3

βa βb

βa3 βb 

βa βb3 

2 2
βa βb
(4.151)
(4.152)
After the SDP terminates, all of the variables within vector X have been estimated.
The beta variables in the original problem formulation can easily be recoved. Because the
sum-of-squares decomposition is not unique, a (trivial) sign ambiguity exists that places the
simplex in front or behind the camera. The two variable case is shown as an example below.
p
p
βa = ± βa2 = ± X1
βa βb
X7
βb =
=
βa
βa
68
(4.153)
During the discussion of the algorithm, only examples of the two variable case where
provided. However, it is more desirable to use additional eigenvectors within the SDP.
Obviously, there is a performance/complexity trade-off, and it is not recommended to use
more than four eigenvectors. The appendix contains additional information that specifically
outlines the three and four variable cases. Note that we found the best results are produced
with four eigenvectors but presently have not been able to prove why it happens.
69
Chapter 5
Recursive Estimation
The previous chapters describe the two measurement sources (or inputs) to the pose estimation system; specifically, how raw data is transformed into a usable measurement. Regardless
of its source, each measurement may be thought of as a noisy observation of the same object
at some point in time. A method is needed to fuse the measurements together and account
for their uncertainty. This chapter covers recursive state estimation for the sole purpose of
pose estimation. [35] is an excellent resource for recursive estimation techniques.
Figure 5.1: Markov Process
70
5.1
Recursive Estimation of Markov Processes
A Markov process, figure 5.1, is a stochastic process in which the probability distribution of
the current state is conditionally independent of the path of past states. A Markov model is
a statistical model in which the system being modeled is assumed to be a Markov process.
In a regular Markov model, the state is directly observable. In a hidden Markov model, the
state is not directly observable, but variables influenced by the state are observable. Because
of the Markov assumption, the true state is conditionally independent of all earlier states
given the immediately previous state.
p(xk |x0 , . . . , xk−1 ) = p(xk |xk−1 )
(5.1)
Similarly the measurement at the kth time step is dependent only upon the current state
and is conditionally independent of all other states given the current state.
p(zk |x0 , . . . , xk ) = p(zk |xk )
(5.2)
Using these assumptions the probability distribution over all states and measurements of the
HMM can be written simply as:
p(x0 , . . . , xk , z1 , . . . , zk ) = p(x0 )
k
Y
p(zi |xi )p(xi |xi−1 )
(5.3)
i=1
Recursive Bayesian estimation is a general probabilistic approach for estimating an
unknown probability density function recursively over time using incoming measurements
and a mathematical process model.
Because all uncertainty is our system is assumed to be Gaussian, we are concerned with
a family of recursive estimators, collectively called Gaussian filters. Historically, Gaussian
filters constitute the earliest tractable implementations of Bayes filter for continues spaces.
They are also by far the most popular family of techniques to date - despite a number of
shortcomings.
Gaussian techniques all share the basic idea that probability distributions are represented by multivariate normal distributions. In the moments parameterization, the density
is parameterized by a mean and covariance, the first and second moment respectively; all
higher moments are completely specified by the first two moments.
71
5.2
Gaussian & Nonparametric Filters
The most studied technique for implementing Bayes filters is the Kalman Filter which was
introduced by Swerling (1958) and Kalman (1960). It was invented as a technique for filtering
and prediction of linear Gaussian systems by implementing belief computation for continuous
states. Beliefs (conditional probability) are represented by the moments parameterization.
Posteriors are Gaussian if the following three properties apply, in addition to the Markov
assumptions of the Bayes filter.
The state update conditional probability, p(xk |uk , xk−1 ), must be Gaussian. One way
of specifying this is through the linear discrete-time dynamic equation
xk = Fk xk−1 + Gk uk + wk ,
(5.4)
wk ∼ N (0, Qk ) .
(5.5)
where
The measurement conditional probability, p(zk |xk ), must also be Gaussian. This can be
specified by the linear algebraic equation
zk = Hk xk + vk ,
(5.6)
vk ∼ N (0, Rk ) .
(5.7)
where
The initial belief must also be normally distributed
p(x0 ) = N (0, Σ0 ) .
(5.8)
If the properties above are true, the Kalman filter calculates
Σk|k−1 = Fk Σk−1 FkT + Qk
(5.9)
as a prediction of state uncertainty where Q is the process noise covariance. The state and
uncertainty predictions are referred to together as the “time update” equations.
The state and uncertainty are corrected when a measurement is received. First, the
72
Kalman gain or blending factor,
Kk = Σk|k−1 HkT (Hk Σk|k−1 HkT + Rk )−1 ,
(5.10)
is calculated that minimizes the a posteriori error covariance. Next, the measurement is
incorporated to generate an a posteriori state estimate via
xk|k = xk|k−1 + Kk (zk − Hk xk|k−1 ) .
(5.11)
Finally, an a posteriori error covariance is obtained with the Kalman gain,
Σk|k = (I − Kk Hk )Σk|k−1 .
(5.12)
These three equations are collectively referred to as the “measurement update” equations.
Unfortunately, state transitions and measurements are rarely linear in practice. The
Extended Kalman Filter (EKF) relaxes the linearity assumption at the expense of accuracy.
Instead, state transition and measurement probabilities are governed by (differentiable) nonlinear functions,
xk = f (xk−1 , uk ) + wk
(5.13)
zk = h(xk ) + vk .
This violates assumptions made during the derivation of the Kalman filter. A Gaussian
probability passed through a linear function remains Gaussian but is not necessarily Gaussian
if passed through a nonlinear function. The EKF inherits from the Kalman Filter the basic
belief representation, but it differs in that this belief is only approximate. If the initial state
estimate is wrong or if the process is modeled incorrectly, the filter may quickly diverge. The
central idea of the EKF is linearization, approximation of the nonlinear functions about the
current state. However, the estimated covariance matrix tends to underestimate the true
covariance. After linearization the belief propagation is equivalent to the Kalman Filter.
The uncertainty prediction equation of the EKF is the same as within the Kalman
Filter except the matrix, Fk , is the Jacobian of the nonlinear update function, i.e.
Σk|k−1 = Fk Σk−1 FkT + Qk .
73
(5.14)
Similarly, the Kalman gain is computed with the Jacobian Hk of the the nonlinear measurement function via
Kk = Σk|k−1 HkT (Hk Σk|k−1 HkT + Rk )−1 .
(5.15)
As expected, the a posteriori state estimate is calculated with the nonlinear function,
xk|k = xk|k−1 + Kk (zk − hk (xk|k−1 )) .
(5.16)
Finally, an a posteriori error covariance is obtained with the Jacobian and Kalman gain,
Σk|k = (I − Kk Hk )Σk|k−1 .
(5.17)
The EKF uses a first order Taylor series to approximate the nonlinear functions. Matrices
of partial derivatives, Jacobians, are computed as each iteration. The goodness of the linear
approximation applied by the EKF depends on two main factors: The degree of uncertainty
and the degree of local nonlinearity of the functions that are being approximated. In practice, it is important to keep the uncertainty small. However, the EKF can give reasonable
performance.
The Taylor series expansion applied by the EKF is only one way to linearize the
transformation of a Gaussian. The Unscented Kalman Filter (UKF) uses a different approach
and has been found to often yield superior results, [22]. It performs a stochastic linearization
through the use of a weighted statistical linear regression process. The UKF deterministically
extracts so-called sigma points from the Gaussian and passes them through the nonlinear
functions. These points are located at the mean and along the main axes of covariance, two
per dimension. For an n-dimensional Gaussian, 2n+1 sigma points are chosen.
Recall that in the EKF only the mean is considered. After the sigma points are
passed through the nonlinear function the mean and covariance are recovered by a weighted
sum. For nonlinear systems the UKF produces equal or better results than the EKF where
large improvements occur as points of high nonlinearity and uncertainty. The filter more
accurately captures the true mean and covariance. Another advantage of the UFK is that
it does not require the computation of Jacobians, which are difficult to determine in some
domains. The UKF is thus often referred to as a derivative-free filter.
The particle filter, also known as Sequential Monte Carlo (SMC) method, is an alternative nonparametric implementation of the Bayes filter. They were first introduced in
the early 50s by physicists and have become very popular recently in statistic and related
74
fields. Nonparametric filters do not rely on a fixed functional form of the posterior, such as
Gaussians. The state space is approximated by a set of random samples drawn from the
posterior distribution. The quality of the approximation depends on the number of samples
used to represent the posterior. As the number of samples goes to infinity, nonparametric
techniques tend to converge uniformly to the correct posterior, under weak assumptions.
Such a representation is approximate, but it is nonparametric, and therefore can represent
a much broader space of distributions. Another advantage of sample-based representation
is the ability to model nonlinear transformations of random variables. The representational
power of these techniques comes at the price of added computational complexity.
Each particle is an instantiation of the current state. The denser a sub region of state
space is populated by samples, the more likely is that the true state falls into this region.
Like the UKF, particles are assigned weights, but the samples are not recombined into an
estimate of mean and covariance. Instead, some of the particles are kept and others are
replaced by resampling. The resampling step has the important function of forcing particles
back into regions of the state space with high posterior probability.
5.3
Selection of Quaternion EKF
When the pose estimation literature is surveyed, quaternions are found to be a favorite
parameterization choice in every domain, from aerospace to AR, because of their nonsingular
mapping. Consequently, the rigid body motion model using quaternions (chapter two) is
adopted. Understanding system dynamics is an initial step in the design of a recursive
estimator.
From the previous section, we are faced with three different estimators, EKF, UKF and
particle filter. It will be shown below that the state transition and measurement functions
are not highly nonlinear, e.g. no trigonometric functions. Therefore, the Taylor series
linearization method of the EKF should perform fairly well, and the added complexity of
the UKF is unwarranted. This hypothesis is supported by the work of [21] that showed no
significant benefit of the UKF over the EKF for quaternions.
Also, it is assumed that all probability distributions are Gaussian. The assumption is
approximately true for the sensor noise which can be verified through inspection of actual
measurement data. The process noise has yet to be carefully studied and might happen to be
non-Gaussian. The work of [1] suggests that a particle filter works better than an EKF for
75
head motion within AR. However, the slightly improved estimation accuracy did not seem
to out weigh the additional complexity.
Additionally, we have a method of computing camera uncertainty that cannot be
easily integrated into the particle filter. The uncertainty calculation is covered in the last
section of this chapter. For the reasons stated above, a quaternion-based EKF was chosen
for use within our augmented reality system.
5.4
Time Update
Figure 5.2: EKF Prediction and Correction Cycle
The time update is responsible for projecting forward, or predicting, in time the current
state and uncertainty estimates to obtain a priori estimates at the next time step. Recall
from the Markov assumption that only the previous state is required to predict the current
state.
5.4.1
State Vector
The state vector contains seven rotational terms and nine translational terms and is therefore
length sixteen. The rotational terms include the quaternion, M
C q, and angular velocity,
C
ωcam . The translational terms include the position, Mpcam , the velocity, Mvcam , and the
76
acceleration,
M
acam .
x=
h
M
Cq
C
ωcam
M
pcam
M
vcam
M
acam
iT
h
iT
= q0 qx qy qz ωx ωy ωz px py pz vx vy vz ax ay az
(5.18)
Note that the quaternion and position elements together form the transformation of the
M
M
camera {C} with respect to the model {M}, H M
C q, pcam = C H. In contrast, and maybe
a subtlety, the computer algorithms and perspective projection model of chapter four rely
on the inverse transformation, C
MH.
5.4.2
State Prediction
The state transition function is linear for all states except the quaternion. The same is true
for all rotation parameterizations, but with quaternions singularities are avoided.
  k
 
wq
g(xk−1 , ∆t)
qk|k−1
  k

 
 wω 
ωk|k−1  
ωk−1
  

 
2
 pk|k−1  = pk−1 + ∆t vk−1 + ∆t ak−1  + wk 
  p

 
2
  k

 
vk−1 + ∆t ak−1
  wv 
 vk|k−1  
wak
ak−1
ak|k−1

(5.19)
This expression may be written more compactly as a single nonlinear function
xk|k−1 = f (xk−1 , ∆t) + wk−1 .
(5.20)
Because the time step, ∆t, and angular velocity, ω, are both assumed to be small, the simple
quaternion update function of equation 2.70 is used, i.e.
g (xk−1 , ∆t) = qk−1 +
∆t
QR (Ωk−1 ) qk−1 .
2
(5.21)
This assumption makes the function less nonlinear by avoiding the trigonometric functions
of equation 2.71 but might compromise accuracy if the assumption is violated. Recall that
the EKF can perform very poorly in areas of high nonlinearity.
77
5.4.3
Uncertainty Prediction
The Jacobian, Fk , of the state transition function may be quickly computed analytically.
At each time step, only the elements associated with the quaternion will change, a 4x7
sub-block. i.e.
Fk =
∂f (xk−1 , ∆t)
∂xk−1
 ∂g(xk−1 ,∆t) ∂g(xk−1 ,∆t)



=



∂qk−1
∂ωk−1
0
0
0
0
I3
0
0
0
0
0
0


0 0
0 

2
I3 ∆t ∆t2 
 .

0 I3 ∆t 
0 0 I3
(5.22)
It is assumed that the variance of the process disturbance model increases as sample time
increases. In other words, inaccuracies of the process model itself are exaggerated as the
amount of time increases. For this reason, a linear time-varying uncertainty matrix is used,

σq2
s
0


Qk = ∆t  0

0

0
where the
5.5
σx2
s
0
2
σω
s
0
0
0
0
0
σp2
s
0
0
0
0
0
σv2
s
0

0

0


0 ,

0

(5.23)
σa2
s
terms represent units of variance per second.
Measurement Update
In the measurement update step the state conditional probability distribution is updated
with sensor data. In our system, two of the measurement functions are linear and one is
nonlinear. The degree of nonlinearity is low and arises from vector rotation with quaternions.
For any rotation parameterization a similar nonlinear function would be required.
78
5.5.1
Camera Update
Of the camera, gyros, and accelerometers, the camera is the best measurement source for
the pose estimation system. It provides absolute position and orientation data,
"
zcam
#
±qk
=
+ vk ,
pk
(5.24)
whereas the IMU only provides derivatives of these states. Fortunately, the camera’s measurement function is linear,
"
#
qk|k−1
Hcam xk|k−1 =
(5.25)
pk|k−1
where
"
Hcam
I4 0 0 0 0
=
0 0 I3 0 0
#
,
(5.26)
and therefore the filter will produce an (nearly) optimal estimate of state and uncertainty.
Measurements are received at frame rate assuming the fiducials are within the field of view,
typically 25-30 Hz.
Before using a camera measurement, it is necessary to perform a simple check for the
“mirrored” quaternion, −(qk ). Recall that the uniqueness of a rotation quaternion is two,
i.e. two opposite points or poles on the hypersphere. The Kalman Filter will effectively
interpolate between the predicted and measured quaternion so it is of great importance to
use the correct sign. Specifically, if q̂k|k−1 − qk > q̂k|k−1 + qk than the quaternion of the
measurement should be negated.
5.5.2
Gyro Update
The gyro noise level is relatively low and the sample time is much quicker than the camera,
∼ 100 Hz. Thus, the sensor helps smooth orientation between camera updates. Because
angular velocity is one of the states,
zgyro = ωk + vk ,
79
(5.27)
the gyro measurement function is also linear,
Hgyro xk|k−1 = ωk|k−1
(5.28)
h
i
Hgyro = 0 I3 0 0 0 .
(5.29)
where
5.5.3
Accelerometer Update
In our system, the IMU returns angular velocity and acceleration at the same rate and within
the same measurement vector. The measurements are presented separately for a couple of
reasons. First, in general, a system might have only gyros or accelerometers; or, a system
might have both sensors which operate asynchronously. Secondly, in our implementation the
accelerometer is not always used (discussed below).
An acceleration measurement is a combination of gravity and acceleration of the IMU
itself,
C
zacl = Cacam + Cagrav + vk ,
(5.30)
which has been rotated into the camera frame with the rotation matrix discussed in chapter
three. A vector expressed in the camera or model frame can be rotated using the predicted
orientation and quaternion multiplication via
"
and
"
#
"
#
0
0
−1
= qk|k−1 ⊗ C
⊗ qk|k−1
M
zacl
zacl
(5.31)
#
"
#
0
0
−1
= qk|k−1
⊗ M
⊗ qk|k−1 .
C
zacl
zacl
(5.32)
Alternatively, the same operation can be performed by converting the quaternion back into
a rotation matrix and then multiplying the vector, such as
C
zacl =
C
M
MR zacl
−1
= R(qk|k−1
) Mzacl .
(5.33)
We make the following assumption for human motion that would likely not apply to
most other domains. When the magnitude of acceleration is approximately equal to 9.8 sm2 ,
80
it is assumed that only gravity is measured and the IMU has no acceleration in the world
frame.

C m
 Ca ,
zacl ≤ (9.8 m2 + σa )
(9.8
−
σ
)
≤
grav
a
2
s
s
C
(5.34)
zacl =
C
 Ca
+
a
,
otherwise
cam
grav
We have found that our accelerometers contain a significant level of noise, which
if used to update position, will be integrated twice. For this reason, the sensor is only
used to update orientation at times when the magnitude is approximately equal to that of
gravity. However, for completeness, the measurement update function for both position and
orientation is presented later. Furthermore, future enhancement of the AR system might
involve replacing the IMU with a less noisy sensor.
5.5.3.1
Orientation
Both measurement updates rely on the fact that we precisely know gravity in the model.
Gravity vectors in the model and camera frames are related by a simple rotation,
C
agrav =
C
M
MR agrav
.
(5.35)
To correct orientation only, gravity is rotated into the camera frame with the predicted
quaternion
−1
hacl xk|k−1 , Magrav = R(qk|k−1
) Magrav .
(5.36)
Nonlinearity is introduced by the parameterization of the rotation matrix with the quaternion. The Jacobian may be quickly computed analytically as
k
Hacl
=
5.5.3.2
h
∂hacl (xk|k−1 , Magrav )
∂qk|k−1
i
0 0 0 0 .
(5.37)
Position and Orientation
Composite acceleration vectors in the model and camera frames are also related by a simple
rotation,
C
M
acam + Cagrav = C
acam + Magrav .
(5.38)
MR
81
To correct both position and orientation, acceleration is rotated into the camera frame with
the predicted quaternion,
−1
hacl xk|k−1 , Magrav = R(qk|k−1
) ak|k−1 +
M
agrav .
(5.39)
The Jacobian is almost the same as above but includes an additional term for acceleration
within the state vector,
k
Hacl
=
5.5.4
h
∂hacl (xk|k−1 , Magrav )
∂qk|k−1
0 0 0
∂hacl (xk|k−1 , Magrav )
∂ak|k−1
i
.
(5.40)
Quaternion Constraint
The Kalman Filter, EKF, and UKF do not directly impose any constraints on vector length.
Consequently, the quaternion might not be unit length after a measurement update. One
option is to simply perform a brute force normalization after each measurement update.
q̂k|k =
qk|k
kqk|k k
(5.41)
Unfortunately, this does not fix the problem entirely because the uncertainty is also constrained in a similar way as the state. Several papers propose techniques by modifying the
Kalman Filter and adding a correction step, [39], [7], [43]. In our case, no ill-effects are
observed because the sample time and uncertainty are both assumed to be small.
82
5.6
Camera Uncertainty
All measurement sources into the EKF contain an estimated uncertainty in the form of a
covariance matrix. The gyro and accelerometer covariance matrices remain constant throughout time whereas the camera uncertainty, Σkcam , is different at each time step. Noisy image
points cause uncertainty in the final solution of the computer vision algorithm(s). How this
noise propagates through the optimization depends on the pose of the model with respect to
the camera and the configuration of model points. Some poses and configurations are more
sensitive to noise and should therefore be assigned a higher degree of uncertainty. In this
section, methods are developed for calculating pose uncertainty with quaternions in object
space. An unconstrained image space pose uncertainty calculation can be found in Vincent
and Hoff [41].
5.6.1
Derivation
Recall from chapter four that a computer vision algorithm will minimize a residual in either
image space or object space. This derivation relies on prior knowledge of which error space
the minimization occurred and a model of image point noise, assumed to be Gaussian with
zero mean and known variance. For reasons already stated previously, the elements of the rotation matrix are often parameterized, and that the entire 6-DOF pose can be parameterized
by a minimum of six parameters. The image space,
2 2
r2 (Θ) MPmi + ty (Θ)
r1 (Θ) MPmi + tx (Θ)
+ u0 + nu − ui + fv
+ v0 + nv − vi
,
ei (Θ, n) = fu
r3 (Θ) MPmi + tz (Θ)
r3 (Θ) MPmi + tz (Θ)
(5.42)
and object space,
2
ei (Θ, n) = (I − Fi (n)) R(Θ) MPmi + t(Θ) ,
(5.43)
error functions are rewritten to include a noise term and pose parameterization. Notice how
noise propagates though the functions differently. In our particular case, quaternions are
used within the EKF so the pose parameterization vector,
h
iT
Θ = q0 qx qy qz tx ty tz
,
83
(5.44)
is length seven, where H(q, t)−1 =
function and single variable,
f (x) =
∞
X
f (n) (a)
n=0
n!
C
MH.
Recall the form of the Taylor Series for an arbitrary
0
(x − a)n = f (a) +
00
f (a)(x − a) f (a)(x − a)2
+
+ ... .
1!
2!
(5.45)
The error functions are approximated by a first-order Taylor Series in two variables, pose
and noise, via
∂ei (Θ̂, 0)
∂ei (Θ̂, 0)
(Θ − Θ̂) +
(n − 0)
∂Θ
∂Θ
,
≈ ei (Θ̂, 0) + Ai (Θ − Θ̂) + Bi n
ei (Θ, n) ≈ ei (Θ̂, 0) +
(5.46)
where Θ̂ is the estimated pose of the computer vision algorithm solution and Θ is the true
pose which, of course, is not known in practice. It is assumed that the computer vision
algorithm converged properly to the global minima of the error function. Therefore, the first
order Taylor series is an approximation of the function evaluated at the minima,
min
Θ
n
X
i=1
kei (Θ, n)k2 ≈ min
Θ
n 2
X
ei (Θ̂, 0) + Ai (Θ − Θ̂) + Bi n .
(5.47)
i=1
Let Θ̃ equal the approximate error between the true and estimated pose, Θ − Θ̂. It
can be calculated, in a least squares sense, using the projection theorem,
Θ̃ = (AT A)−1 AT (E(Θ̂, 0) + B n)
(5.48)
= C(E(Θ̂, 0) + B n)
= C E(Θ̂, 0) + C B n .
The large matrices A and B are constructed from
the entire set. The A matrix is vertically stacked
Specifically,
 

A1
B1
 . 


. 
A=
 .  , B=0
An
0
individual Jacobians of each point for
while the B matrix is block diagonal.
0
..
.
0
0


0
 .
Bn
(5.49)
Because the image noise is assumed to have zero mean, n̄ = 0, the average error between the
84
true and estimate pose is equal to
¯ = C E(Θ̂, 0) .
Θ̃
(5.50)
Therefore, the standard deviation of the error between the true and estimate pose is approximately
h
i
¯ = C E(Θ̂, 0) + C B n − C E(Θ̂, 0) = C B n .
(Θ̃ − Θ̃)
(5.51)
A least-squares estimate of the pose error covariance can now be derived. Its accuracy is
only as good as the first order linearization.
T ¯
¯
COV(Θ̃) = E Θ̃ − Θ̃ Θ̃ − Θ̃
h
i
= E (C B n) (C B n)T
= E C B n nT B T C T
= C B E n nT B T C T
(5.52)
= C B COV(n) B T C T
5.6.2
Image Space
The pose uncertainty calculation assuming an image space minimization is presented first
because more computer vision algorithms operate in image space and the calculation itself
is easier. As stated earlier, the image point noise is assumed to be Gaussian with zero mean
and known variance,
" # "
#
nu
N (0, σu )
∼
.
(5.53)
nv
N (0, σv )
If the noise components of a single image point are uncorrelated, the covariance may be
written as
"
#
σu2 0
cov(n) =
.
(5.54)
0 σv2
85
Furthermore, a large covariance matrix for the entire point set may be constructed from
smaller covariance matrices along the diagonal,


cov(n) 0
0


..

COV(n) = 
.
0
0


0
0 cov(n)
(5.55)
The parameterized image space error function has two error components,
r1 (Θ) MPmi + tx (Θ)
+ u0 + nu − ui
r3 (Θ) MPmi + tz (Θ)
(5.56)
r2 (Θ) MPmi + ty (Θ)
+ v0 + nv − vi .
r3 (Θ) MPmi + tz (Θ)
(5.57)
eiu (Θ, n) = fu
eiv (Θ, n) = fv
Because noise enters linearly into the error function, the Jacobians with respect to noise, Bi ,
equal identity and the pose error covariance equation 5.52 can be simplified to
COV(Θ̃) = C B COV(n) B T C T
= C I COV(n) I T C T
(5.58)
= C COV(n) C T .
The Jacobian for a single image point with quaternion pose parameterization has the form
of
#
" i
i
Ai =
5.6.3
∂eu (Θ̂,0)
∂q
∂eiv (Θ̂,0)
∂q
∂eu (Θ̂,0)
∂t
∂eiv (Θ̂,0)
∂t
.
(5.59)
Object Space
The pose uncertainty calculation in object space is more complicated than image space, but
it is still desirable to have for algorithms such as the Orthogonal Iteration Algorithm of
chapter four. Because noise enters nonlinearly into the error function through the projection
matrix,
(wi + n)(wi + n)T
Fi (n) =
,
(5.60)
(wi + n)T (wi + n)
the simplified pose error covariance calculation of image space does not apply. Recall that the
projection matrix requires normalized image points, but defining a noise distribution within a
86
full-size image is more intuitive. In other words, let the full-image noise distribution undergo
a transformation,
#
"
n=
N (0,σu )
fu
N (0,σv )
fv
.
(5.61)
Fortunately, the new distribution is still Gaussian because the operation is linear, i.e. division
of focal length. Now the modified covariance matrix for an individual image point is written
as
#
"
( σfuu )2
0
,
(5.62)
cov(n) =
0
( σfvv )2
and the large covariance matrix of the entire point set is constructed like before. The
parameterized object space error function has three components,
eix (Θ, n) = I1 − f1i (n) R(Θ) MPmi + t(Θ)
(5.63)
eiy (Θ, n) = I2 − f2i (n) R(Θ) MPmi + t(Θ)
eiz (Θ, n) = I3 − f3i (n) R(Θ) MPmi + t(Θ) .
(5.64)
(5.65)
The Jacobians (pose and noise) for a single image point with quaternion pose parameterization have for form of

 i

 i
i
i
Ai =
5.6.4
∂ex (Θ̂,0)
 ∂ei ∂q
 y (Θ̂,0)
 ∂q
∂eiz (Θ̂,0)
∂q
∂ex (Θ̂,0)
∂t

∂eiy (Θ̂,0) 

∂t
∂eiz (Θ̂,0)
∂t
, Bi =
∂ex (Θ̂,0)
u
 ∂ei∂n(Θ̂,0)
 y
 ∂nu
∂eiz (Θ̂,0)
∂nu
∂ex (Θ̂,0)
∂nv 
∂eiy (Θ̂,0) 
∂nv 
∂eiz (Θ̂,0)
∂nv
.
(5.66)
Quaternion Constraint
The unit length constraint is a disadvantage of quaternions as a parameterization within
the pose uncertainty calculation. The constraint that the true quaternion is unit length,
kqk = 1, and that the error lies along the hypersphere must somehow be enforced. Recall
the minimization from the pose uncertainty derivation,
2
2
min e(Θ̂, 0) + A Θ̃ + B n = min y − A Θ̃ .
Θ̃
(5.67)
Θ̃
Forcing the error to be along the hypersphere cannot be achieved through a linear operation.
Instead, the constraint is relaxed so that the error lies in a hyperplane which is tangent at
87
Figure 5.3: Hyperplane Tangent to Hypersphere
the point of the estimated quaternion. See figure 5.3. This is equivalent to saying the
error vector must be orthogonal to the estimated quaternion, and therefore, the dot product
should equal zero, q̂ · (q − q̂) = 0. The minimization is rewritten to include the dot product
constraint,
2
min y − A Θ̃ such that D Θ̃ = 0 ,
(5.68)
Θ̃
where
h
i
D = qˆ0 qˆx qˆy qˆz 0 0 0 .
(5.69)
This constraint forces the error vector into the null space of D. The translation elements
are free dimensions because they are not part of the dot product or orthogonality constraint
that defines the hyperplane in 4-dimensional space.
N = Null(D) = {x ∈ Rn : D x = 0}
(5.70)
DN x = 0 ∀x
(5.71)
The Householder Transformation is a very quick way to calculate the null space of a vector.
The minimization is written again to include the linear null space constraint.
min ky − A N xk2
(5.72)
x̂ = ((A N )T (A N ))−1 (A N )T y
(5.73)
x
Finally, the pose covariance equation is modified slightly to include the quaternion constraint,
i.e.
C = N ((A N )T (A N ))−1 (A N )T .
(5.74)
88
Chapter 6
Real-time System
The goal of this chapter is to show how all of the concepts covered previously are connected into a functional real-time augmented reality system. Brief hardware and software
descriptions are provided first. Then, important software subsystems are covered in more
detail.
6.1
Hardware Description
The hardware design of the system is very simple because off-the-shelf components could be
used. In other words, no custom hardware was needed. See figure 6.1 for a complete list.
The Prosilica GC1020C and Crista IMU are the only sensor inputs to the system. A gigabit
ethernet camera was chosen because of its ability to reliably stream massive amounts of video
data over long cable distance. Powerful on-board processing converts raw Bayer images to
8-bit RGB and then transmits them as very large 9K packets, ”Jumbo frames”. A camera
of this type places minimal burden on the workstation allowing resources to be dedicated to
other processes. A wide-angle lens, ∼ 70◦ , is used so that cones remain in the field of view
as often as possible. The Crista IMU was chosen because of its low cost, standard RS232
interface, and prior experience with the product. A middle-end workstation with dual-core
Xeon processors and 2GB of RAM is only slightly modified by replacement of the graphics
and ethernet cards.
Figure 6.2 shows the sensor system with LCD attached by a custom mounting bracket
that was manufactured within a of couple hours. The sensors and display are all rigidly
89
Prosilica GC1020C
Crista IMU
Dell Precision 690
Nvidia 8800GTS
Xenarc 840YV
Kowa LM4NCL
Intel PWLA8391GT
4” Traffic Cones (x3)
Gigabit Ethernet Camera, 1024x768,
30fps, 1/3” CCD
MEMS Accelerometer and Gyroscope, 3Axis, 100Hz, Serial Interface
Dual Core Xeon 5130 2GHz, 4MB
L2 Cache 1333MHz, 2GB DDR2 ECC
SDRAM Memory 667MHz
Middle-end Graphics Card / GPU
8.4” LCD, 1024x768
Wide-Angle C-mount Lens, 3.5mm F1.4
Gigabit Ethernet Card, Jumbo Frames
Supported, 9K Packet Size
Fiducials, Larger Size Could Easily Be
Used
$1200
$1300
$1300
$200
$200
$100
$25
$10
Figure 6.1: Hardware Components & Sensors
attached. To reduce the induced moment of the IMU about the camera, the sensors are
placed as closely together as possible, less than 6cm. The LCD is aligned with the camera
so that the relationship between the two coordinate frames is merely a small translation in
the negative z direction, along the camera’s optical axis. To support the choice of a video
see-through LCD, a short discussion of augmented reality display options is provided next.
As stated in chapter one, basically two different display technologies are available
for use in AR, optical and video see-through. Optical see-through allows the user to see
the real world directly. Graphics are projected into the user’s field of view with an optical
combiner, such as half-silvered mirrors. Video see-through does not let the user see the real
world directly. Instead, a video stream is blended with graphics and displayed to the user.
Typically, head-mounted displays (HMD) are used but a handheld LCD is another option.
Any display will have advantages and disadvantages but the choice of technology typically
depends on the application requirements.
Optical blending is less computationally expensive than video blending because only
virtual objects must be rendered. Also, the video stream may be considerably warped by
lens distortion, and the focal length will likely not be equivalent to that of the human eye.
Video see-though is also limited by the physical resolution of the camera and artifacts may
be introduced by compression.
Optical combiners reduce the amount of light from the real world that reaches the
90
Figure 6.2: Sensors & LCD
eye, somewhere around 30%. As a result, virtual objects do not completely obscure realworld objects because light is combined from both real and virtual sources. In high ambient
light environments, virtual graphics will appear opaque or may even vanish completely.
Optical see-though offers an instantaneous view of the real-world, but the virtual objects will
inherently be delayed by tracking and rendering delays. Video see-though has the advantage
of matching real and virtual graphics in software when delays are present.
In addition to estimating pose of the display itself, optical see-though requires an
extra calibration step and eye tracking for accurate registration. The calibration, known as
occultation, models the distance of the eyes relative to each other and the display. Similarly,
in video see-through, a large separation between camera and display can create perceptual
discomfort, but the mind is typically able to seamlessly compensate for small disparity.
Finally, video see-though has complete control over what the user sees, at the pixel level,
and may ultimately produce more compelling environments. If a camera is already present
for tracking purposes, it makes sense to use it within a video see-through display.
91
Pose Estimation & AR (CSM-PEARL)
Open Computer Vision (OpenCV)
wxWidgets (?)
Custom Testbed Library
Image Processing & Computer Vision Library
Performance 3D Graphics
Library
Cross-Platform GUI Library
Prosilica Gigabit Ethernet (PvAPI)
Cloudcap Serial-IMU (CSI?)
Ethernet Camera Interface
Serial IMU Interface
Integrated Performance Primitives (IPP)
Optimized Functions for Intel Chipsets
Open Scene Graph (OSG)
Awesome!
Open
Source
Open
Source
Open
Source
Free
Open
Source
Free*
Figure 6.3: Libraries / Dependencies
6.2
Software Description
Unlike the hardware design, the software architecture design is much more complicated.
The seven libraries shown in figure 6.3 were needed. Fortunately, all of the libraries are
either open source or free for educational use. A large code base (CSM-PEARL) was written
specifically for the testbed and is provided on-line for download (control.mines.edu).
Figure 6.4 shows an overall view of the software architecture. Data from the two
sensors is processed and placed into a circular buffer. The EKF pulls measurements from
the buffer and produces a filtered pose estimate. Graphics are generated by blending the
most recent pose estimate and camera image. A GUI allows control of sensors, EKF filter
parameters, and various rendering options. A real-time system like this will inevitably require
parallel processing. Figure 6.5 lists the threads that were required.
6.3
Measurement Buffer
The measurement buffer, figure 6.6, is an essential subsystem which does much more than
simply buffer measurements. It coordinates activity of the sensor threads and EKF thread.
The workstation is powerful enough so that the buffer never becomes completely full. Measurements and time-stamp from both sensor sources are placed into a set of nodes, figure
6.7. The circular buffer is very similar in structure to a standard linked list with a few extra
92
Figure 6.4: Full System
Camera
Vision
IMU
EKF
Graphics
GUI
Color Classification & Buffering
Segmentation, Pose Estimation, & Buffering
Data Processing & Buffering
Sensor Fusion & Recursive Estimation
Virtual Model Update & Data Logging
User Control of Sensors & Estimation
Figure 6.5: Threads
93
Figure 6.6: Measurement Buffer
94
Figure 6.7: Measurement Node
features.
As soon as new sensor data begins to be processed, the current write node it set to
pending status and a time-stamp field is recorded. When processing is completed, the final
measurement is written and pending status is dropped. The pending concept is extremely
important here because the sensors operate asynchronously and the camera processing time is
much longer than the IMU. Consequently, IMU measurements are allowed to “pile up” behind
a camera measurement in the buffer. Typically, the EKF waits a short time for a camera
measurement to be completed (no longer pending) and then all of the IMU measurements
behind it are quickly processed afterwards.
The measurement type tells the EKF which data structures within a node to use, either
camera or IMU. Additionally, the type may be set to null so that if a pending measurement
fails to yield a valid measurement, such no fiducials in the field of view, it is still completed.
When the EKF sees such a node it is simply ignored and quickly moves on to read the next
node.
95
Figure 6.8: Vision Thread
96
6.4
Vision Thread
The vision thread is the most demanding of all threads, figure 6.8. It receives a binary image
from the camera buffer and produces a pose estimate with associated uncertainty. Cones
are segmented according to the procedure outlined in chapter two. Given a predicted pose
from the EKF, the model point locations are corrected, and then correspondence is found
between the 2D-3D set. The non-iterative 3-Simplex and Orthogonal Iteration algorithms
of chapter four generate a pose estimate. The 3-simplex algorithm is not used when a valid
initial guess from the EKF is available. The object space uncertainty algorithm of chapter
five generates a new covariance matrix at each iteration.
Before accepting a pose estimate to be valid, two constraints are imposed. The goal of
the constraint is to reject vision solutions that do not converge properly to the global minima.
This could happen for three reasons. First, the segmented image points may be extremely
noisy or a cone could have been falsely detected. Second, the 2D-3D correspondence may
not be correct. Last, the algorithm may have been provided a bad initial guess far outside
the neighborhood of the global minima. The camera uncertainty calculation assumes that an
error function has been minimized, and it is therefore important to detect outliers that violate
the assumption. If an outlier is not caught, the uncertainty will be severely underestimated
and the EKF will trust the measurement much more than is should.
6.4.1
Image Space Error Residual Constraint
Recall the form of the image space error function 4.1 as a sum of squared 2-dimensional
error on the image plane. Assume the individual errors of each image point are iid Gaussian
random variables, eiu ∼ N(0, σu ) and eiv ∼ N(0, σv ). Therefore, equation 4.1 can be modified
slightly into the form of the unit variance Chi-Square distribution.
χ2k
=
k
X
(x̂ − xi )2
i=1
x=
n
X
(ûi − ui )2
i=1
σu2
(6.1)
σx2
n
2
2
(v̂i − vi )2 X (eiu )
(eiv )
+
=
+
, k = 2n
σv2
σv2
σv2
i=1
(6.2)
A threshold probability and the Chi-square CDF define acceptance and rejection re97
gions on the residual within a binary hypothesis test. In practice, a lookup table of the CDF
is used because the gamma functions, γ(k/2, x/2) and Γ(k/2), are difficult to calculate. See
the appendix for more information.
Fχ (x, k) =
γ(k/2, x/2)
Γ(k/2)

Accept, F (x, k) ≤ threshold
χ
Constraint =
Reject, otherwise
6.4.2
(6.3)
(6.4)
Relative Change in Motion Constraint
Human motion is restricted to small changes in position and orientation over a short time
interval, in our case ∆t < 10ms. Assuming the predicted pose of the EKF is nearly correct,
a relative change in motion from the new vision solution may be quickly computed. The
model frame remains fixed in time while the camera frame is likely moving.
C1
MH
C1
C2H
=
=
C
MHk|k−1
C1
MH
,
C2
MH
−1
C2
MH
=
C
MHk|k
o
n
→ k̂, θ, t
(6.5)
(6.6)
The rotation matrix is decomposed into the angle-axis representation; the axis is ignored. Therefore, relative motion may be described by a single angle and distance. Thresholds are placed on these values to define acceptance and rejection regions on the motion.

Accept, |θ| ≤ θ
thresh ∩ ktk ≤ tthresh
Constraint =
Reject, otherwise
6.5
(6.7)
Graphics Generation
Graphics generation is relatively simple and occurs in two stages, figure 6.9. A background
image from the camera is rendered first, and then the 3D model and pose information are
rendered on top. A few different rendering strategies were tested before deciding on the
one that worked best within our system. First, rendering was performed after all EKF
98
Figure 6.9: Graphics Generation
Figure 6.10: C-130 Hercules
measurement updates. Then, rendering was reduced to occur only after an IMU update.
Finally, a constant rendering rate was adopted which grabs the latest EKF estimate every
25ms. The last option looks the most visually appealing while simultaneously reducing the
burden placed on the workstation to a frame rate just above what is humanly detectable.
No effort was needed to match camera images and EKF estimates because a delay of one or
two frames is not detectable.
Three virtual 3D models were tested within the system. In theory, any of the many
3D file formats supported by Open Scene Graph could be used. However, all of our models
are natively VRML 2.0 format (.wrl) and then converted to the Wavefront format (.obj).
The C130 Hercules of figure 6.10 was used within the military training application discussed
in chapter one.
All six joints of the Kuka manipulator shown in figure 6.11 can be manually adjusted
99
Figure 6.11: Kuka KR6
with a slide bar to control the pose of the gripper. The idea here is that manipulator
kinematics can be taught to students by interactively controlling the arm and seeing it
move immediately in 3-dimensional space. Additionally, this capability might be helpful for
teleoperation of a mobile manipulator within a very sensitive work area where any action
must be carefully planned and verified.
Finally, the Matlab peaks plot of figure 6.12 demonstrates the capability to visualize
complex 3D data sets. VRML files can be generated from any Matlab plot. It is much more
meaningful and easier to comprehend such data when viewed in 3D instead of merely a flat
image on a computer screen.
100
Figure 6.12: Peaks Plot
101
Chapter 7
Experiments
At this point in the thesis, all background material has been covered that is necessary to
understand how the augmented reality system works, and it is now time to show some experimental results. First, various computer vision algorithms (chapter four) are compared
through simulation. Next, a Monte Carlo method is used to verify the pose uncertainty calculations (chapter five). Then, a new method for testing overlay accuracy of virtual graphics
is developed. Finally, results of a “realism” questionnaire completed by CSM students are
presented.
7.1
Computer Vision Algorithms
Simulations are used to compare the performance of various computer vision algorithms of
chapter four. Performance metrics include position and orientation error, image and object
space error, number of outliers, and run time. Each data set contains 500 independent
simulations. To make the simulations as realistic as possible, data is generated to match
what we expect to encounter in our system, but the results may be generalized to a certainty
extent.
The effective focal lengths of our camera, fu = 770, fv = 776, are used to project
3D points onto the image plane, 1024x768, and then Gaussian noise is added to the image
points. At each iteration, a new pose and point cloud are generated so that the points lie
approximately 5’ in front of the camera with 1’ spread. The algorithms are tested with point
clouds containing either 10 or 25 points.
102
Results of the computer vision simulations are shown in the form of a notched boxand-whisker plot. The box plot is a useful way to display differences in data sets without
making any assumption of the underlying statistical distribution. The spacing between the
different parts of the box help indicate the degree of dispersion (spread) and skewness in the
data, and identify outliers.
A box has lines drawn at the lower quartile (Q1), median (Q2), and upper quartile
(Q3). Whiskers show the extent of the rest of the data. They extend to the most extreme
data points that are within 1.5 times the interquartile range, Q3-Q1. Data points beyond
the ends of the whiskers are considered outliers. Instead of plotting the location of outliers,
the total number is placed at the end of the whiskers (non-standard).
In a notched box plot the notches represent a robust estimate of the uncertainty about
the medians for box-to-box comparison. Boxes whose notches do not overlap indicate that
the medians of the two groups differ at the 5% significance level. In other words, you can
conclude, with 95% confidence, that the true medians do differ.
7.1.1
Overall Algorithm Comparison
Group 1
1. DLT
2. MDLT (IS)
3. MDLT (OS)
4. 3-Simplex (IS)
5. 3-Simplex (OS)
Group 2
1. EPnP (v1)
2. EPnP (v2)
3. 3-Simplex SDP (IS)
4. 3-Simplex SDP (OS)
Group 3
1. MDLT + OIA (IS)
2. 3-Simplex + OIA (IS)
3. MDLT + Gauss-Newton (IS)
4. 3-Simplex + Gauss-Newton (IS)
Group 4
1. MDLT + OIA (OS)
2. 3-Simplex + OIA (OS)
3. MDLT + Gauss-Newton (OS)
4. 3-Simplex + Gauss-Newton (OS)
Figure 7.1: Computer Vision Algorithm Test Groups
As shown in figure 7.1, the computer vision algorithms are separated into four groups. The
grouping makes it easier to refer to algorithms during the discussion. The abbreviations
103
IS and OS mean image and object space, respectively. Group 1 contains five non-iterative
algorithms. Group 2 contains three “eigenvector algorithms”, including two from [27] and
a new SDP. Groups 3 & 4 contain algorithm combinations which use either the Orthogonal Iteration Algorithm (OIA) or Gauss-Newton to refine a pose estimate iteratively. The
algorithms of groups 3 & 4 operate in either image or object space, respectively.
All of the groups were tested separately with either 10 or 25 points and 5 or 10 pixel
error. Box plots of orientation error, position error, and runtime are located in the appendix.
See figures 9.1 through 9.8.
When inspecting the results from all of the groups, one of the first observations to
make is that pose error increases with image noise, but with more points the algorithm is
less sensitive to noise. The second observation is that reducing an image or object space
error yields what seem to be equivalent results, when point depth has small variation.
When inspecting group 1 results, the DLT obviously performs much worse than the
other four algorithms. It appears as though the MDLT and 3-Simplex work equally as well
within a linear algorithm. However, runtime of the simplex algorithms is slightly longer than
the MDLT. Similarly, the runtime of the object space algorithms is slightly longer than the
image space algorithms.
When comparing group 2 against group 1, the results are better. The linear eigenvector method (EPnP v1) shows some improvement. A slightly higher gain is seen with
the Gauss-Newton method (EPnP v2), and the most improvement is seen with the new
3-Simplex SDP. However, the runtime of the simplex algorithm is quite terrible. Converting
the problem into a sum-of-square introduces complexity but allows the use of well-known
proofs of convexity and global convergence. In some situations, finding a good initial guess
for an iterative algorithm might be more important than runtime.
When inspecting groups 3 & 4 the results of both groups look very similar. Recall
that the only difference in the groups was the error space where the minimization occurred.
It is more interesting to notice that the OIA and Gauss-Newton algorithms produce almost
identical results at the two noise levels tested. Additionally, the OIA is much slower than
the Gauss-Newton. When the number of points is increased, the added runtime to the OIA
is noticeable as opposed to the Gauss-Newton which stays about the same. This is because
of the data reduction within the “reduced measurement matrices”.
104
7.1.2
Best Algorithm Combinations
Group 5
1. 3-Simplex + OIA (IS)
2. 3-Simplex + OIA (OS)
3. 3-Simplex + Gauss-Newton (IS)
4. 3-Simplex + Gauss-Newton (OS)
Figure 7.2: “Best Algorithm Combinations”
In our pose estimation system, an initial guess is always taken from the EKF, except at times
when tracking has been lost and initialization during startup. In the latter case, an initial
guess must be taken from an image. For this reason, it is necessary to have an algorithm
combination that can generate an initial guess and then (hopefully) iterate toward the global
minima. Group 5, figure 7.2, consists of the “best algorithm combinations” based on the
results above. A question posed earlier was whether or not it matters from which error
space an initial guess is given to the OIA. Testing of group 5 addresses this question by
the use of the linear simplex method in both image and object error spaces. The 3-Simplex
Gauss-Newton is compared against the OIA.
Group 5 is tested at much higher noise levels than in the section above. See figure
9.9 in the appendix. The results indicate that pose error and runtime of the OIA are not
significantly affected by initial guesses from different error spaces. It was already shown that
at low noise levels the two iterative algorithms produce equivalent results, but the GaussNewton is much faster. However, as shown in the results by a greater number of outliers,
the Gauss-Newton algorithm tends to converge less successfully than the OIA when noise
is increased. At σuv = 25 the results still look identical. At σuv = 50 convergence of the
Gauss-Newton has begun to worsen, and then at σuv = 75 the Gauss-Newton is clearly much
worse than the OIA. It is important to note that these noise level are much higher than
we would ever expect to encounter, but it illustrates the point that the OIA has very good
global convergence properties. In other words, in our system the OIA proof of convergence
was more important than the asymptotic efficiency of the Gauss-Newton.
105
7.1.3
Camera Uncertainty
To verify the pose uncertainty calculations of chapter five, a Monte Carlo method is used.
Recall that the uncertainty is equivalent to the variance of the error between the true and
estimated pose. If the uncertainty calculation performs well, it should predict a covariance
equal to the covariance obtained through repeated sampling in the Monte Carlo simulations.
In the computer algorithm simulations of the last section, completely new data is
generated at each iteration. Here, the same pose and 3D point set are retained for the
duration of the simulation and only random image point noise is generated at each trial.
The number of trials needed to approximate the true covariance was found by inspecting the
variance of the data and seeing that it converged around 1000 trials.
Nonlinear least-squares computer vision algorithms with Euler angle parameterization
minimize a residual in image or object space. The algorithms are provided the true pose as
an initial guess so that they always converge properly to the global minima. Recall from
chapter five that the uncertainty calculation assumes an error function has been minimized
to the global minima; hence, it converged properly.
All simulations use a set of 25 model points. The results are shown in figures 7.3
through 7.6. Each table contains actual and estimation covariances for three different trials.
The first, second, and third trials have 5, 7.5, and 10 pixel noise, respectively. All four figures
indicate that the uncertainty calculations do in fact accurately estimate covariance.
Trial, σ
Estimated
Actual
α
0.0469
0.0487
β
0.0212
0.0218
γ
0.0439
0.0443
tx
5.7136
5.7620
ty
5.2124
5.2909
tz
6.6508
6.9735
Estimated
Actual
0.0220
0.0222
0.0179
0.0168
0.0168
0.0158
4.7803
5.0047
6.3891
6.1433
6.2329
6.2856
Estimated
Actual
0.0396
0.0303
0.0223
0.0231
0.0378
0.0261
7.7118
6.9733
8.6137
6.0114
9.8118
7.4261
Figure 7.3: Camera Uncertainty, Image Space, Euler Angles
106
Trial, σ
Estimated
Actual
q0
0.0077
0.0075
qx
0.0101
0.0103
qy
0.0101
0.0105
qz
0.0077
0.0072
tx
5.7136
5.7620
ty
5.2124
5.2909
tz
6.6508
6.9735
Estimated
Actual
0.0097
0.0098
0.0057
0.0050
0.0056
0.0052
0.0101
0.0099
4.7803
5.0047
6.3891
6.1433
6.2329
6.2856
Estimated
Actual
0.0135
0.0109
0.0115
0.0083
0.0130
0.0115
0.0132
0.0106
7.7118
6.9733
8.6137
6.0114
9.8118
7.4261
Figure 7.4: Camera Uncertainty, Image Space, Quaternion
Trial, σ
Estimated
Actual
α
0.0475
0.0486
β
0.0217
0.0213
γ
0.0447
0.0459
tx
5.8582
5.7363
ty
5.2913
5.2787
tz
6.7576
6.6861
Estimated
Actual
0.0222
0.0224
0.0181
0.0164
0.0169
0.0158
4.8155
4.9306
6.4386
5.9908
6.2976
6.3086
Estimated
Actual
0.0403
0.0296
0.0224
0.0234
0.0385
0.0284
7.7852
7.1507
8.7470
6.2901
9.9881
7.0801
Figure 7.5: Camera Uncertainty, Object Space, Euler Angles
Trial, σ
Estimated
Actual
q0
0.0079
0.0079
qx
0.0103
0.0102
qy
0.0104
0.0101
qz
0.0079
0.0076
tx
5.8579
5.7363
ty
5.2916
5.2787
tz
6.7578
6.6861
Estimated
Actual
0.0098
0.0096
0.0058
0.0050
0.0056
0.0053
0.0102
0.0099
4.8153
4.9306
6.4386
5.9908
6.2969
6.3086
Estimated
Actual
0.0136
0.0114
0.0117
0.0084
0.0133
0.0114
0.0135
0.0111
7.7862
7.1507
8.7453
6.2901
9.9881
7.0801
Figure 7.6: Camera Uncertainty, Object Space, Quaternion
107
Figure 7.7: Overlay Test Platform
7.2
Overlay Test Platform
Unlike other applications where absolute position and orientation error are important, the
success of a pose estimation system within AR may be measured by overlay accuracy. In
other words, it is only important to analyze how “good” the overlay looks to a user. Goodness
is defined by the degree of proper alignment of virtual graphics with the real-world. For this
reason, an overlay test platform, figure 7.7, was created to clearly indicated misalignment.
The grid of red dots represents a virtual object which should be aligned with the
grid of blue dots. In real-time, the screen of the video see-through display is recorded to
file for off-line analysis. Using basic image processing techniques, the red and blue dots are
automatically segmented within all video frames. Misalignment occurs when a red dot does
not lie exactly in the middle of four blue dots. An error in pixels is related to an actual
distance error by a simple scale factor.
The results of figures 7.8 and 7.9 show that the alignment error has about 1.5” standard deviation. Additionally, the results indicate that mean of the error is nonzero and
therefore a small bias or miscalibration must exist.
108
Trial #
1
2
3
µx
0.6719
0.5030
0.6635
µy
-0.5526
-0.4178
-0.3707
σx
1.3154
1.4680
1.3901
σy
1.5251
1.2923
1.4203
Figure 7.8: Overlay Accuracy, without Gyros (Inches)
Trial #
1
2
3
µx
0.3352
0.4105
0.4591
µy
-0.3235
-0.2114
-0.3156
σx
0.8382
1.2336
1.2451
σy
1.0786
1.3096
1.1880
Figure 7.9: Overlay Accuracy, with Gyros (inches)
7.3
Human Testing - Realism
Ultimately, for an augmented reality system to be successful users must perceive the integration of virtual graphics with the real-world to be as realistic as possible. In addition to correct
registration, the fidelity of virtual objects themselves contributes to the realism. For example, a 3D model might appear “blocky” or coarse due to low polygon count, or its brightness
and shadowing might not match the real environment. Furthermore, the AR display might
have low refresh rate, resolution, or luminosity. All of these factors and more may negatively
affect the illusion/realism of virtual objects: jitter, drift/alignment, lag/delay, unnatural,
2-dimensional.
For these reasons a realism questionnaire was completed by CSM students who used
the AR system. It is partly based on the work of [32] which measures three different human
factors components. “Realness” relates to how real the virtual objects seemed and how well
they integrated with the real objects. “Spatial presence” relates to the 3D-ness of the virtual
objects. “Perceptual Stress” taps experiences of the perceptual process itself - whether the
difference between real and virtual drew attention, and whether the perception of the virtual
object needed effort.
The participants answered realism questions with the gyros enabled and disabled,
columns two and three of figure 7.10. All verbal feedback indicated the need for the IMU.
109
Users stated that without the gyros they found it very distracting when the cones were not
in view because the pose of the virtual model quit updating all together.
With Gyros
I had the impression that virtual objects fit naturally 89.4608
into the real world?
I had the impression that I could have touched and 86.3899
grasped the virtual object?
I had the impression that virtual objects were located 91.0677
in 3-dimensional space, not just appearing as flat images on the screen?
Small distractions did not draw my attention away 87.1483
from the integration of virtual objects into the real
world?
I did not have to make an effort to recognize the vir- 96.4061
tual objects as being three-dimensional?
Figure 7.10: Realism Questionnaire
110
Without Gyros
83.3103
82.1588
89.1395
82.9315
95.6133
Chapter 8
Summary and Conclusions
In summary, a new augmented reality system was constructed from low cost off-the-shelf
components. An 8.4” LCD demonstrates handheld video see-through technology. All software was written in standard C++ with free or open source libraries. Three different 3D
models exhibit AR capabilities within four application areas: C130 Hercules - training, Kuka
manipulator - education and teleoperation support, matlab peaks plot - 3D data visualization.
An original vision and inertial pose estimation system was designed and implemented.
The system relies on input from two sensors; camera updates occur at 30Hz, and IMU updates
at 100Hz. A new calibration routine is created for relative sensor orientation using gravity
vectors, extrinsic camera parameters, and an orthogonal matrix method.
Traffic cones are used as 3D fiducials. The system is scalable because any size cone can
be used, 4” to 36”. The use of 3D fiducials creates a much larger work volume than planar
targets. The cones are also inexpensive, durable, stackable, and may be quickly placed for
outdoor use.
A quaternion-based EKF was implemented to filter sensor measurements and avoid
singularities found in other rotation parameterizations. A graphical user interface allows filter
parameters to be tuned during real-time operation. Additionally, sensors may be enabled
and disabled.
A new object space pose uncertainty algorithm with quaternion constraint was designed and verified with a Monte Carlo method. This method is needed for systems that
rely on a quaternion-based estimator and object space algorithms such as the Orthogonal
111
Iteration Algorithm.
Several new computer vision algorithms were created and tested in Matlab. The new
algorithms were compared against popular existing algorithms. Results show several linear
techniques that outperform the DLT method. Gauss-Newton algorithms were shown to be
much faster than the Orthogonal Iteration Algorithm but fail to converge occasionally. The
3-Simplex SDP shows promising results but further analysis in needed to conclusively state
its convergence properties.
A new 6-DOF overlay test platform was created for measuring and analyzing overlay
error in the end-to-end system. Virtual and real features are segmented from a video stream
to measure alignment error. Results clearly indicate low static and dynamic registration
error.
Students completed a “realism” questionnaire, with and without the gyros enabled.
Responses indicated the need for the IMU at times when fiducials are not visible. The
responses also helped indicated areas of improvement.
It is inevitable that the user will look away from the fiducials. During these times a
complementary sensor is needed so that registration error does not grow excessively large
to the point where the user is negatively distracted. The next generation system should
have a lower noise IMU so that accelerometers can be integrated into the EKF for updating
translation.
To reduce motion restrictions caused by data and power cables between the sensor/LCD head and workstation, all computing and power should be moved into a backpack
and run on a laptop. The quality of video see-though display could be enhanced by undistorting the entire video stream instead of only feature points.
Cone segmentation accuracy could likely be improved by using a more complex color
classification method and implementing a predictive template matching scheme. Line detection might also help locate cones within an image.
The 3-Simplex SDP algorithm should be studied more carefully to understand a few
key steps including optimal selection of simplex vertices, optimal combination of distance
and dot product constraints, and how to choose the number of eigenvectors.
112
Bibliography
[1] F. Ababsa, M. Mallem, and D. Roussel. Comparision between particle filter approach
and kalman filter-based technique for head tracking in augmented reality systems, 2004.
[2] Y. Andel-Aziz and H. Karara. Direct linear transform from comparator coordinates into
object space coordinates in close-range photogrammetry, 1978.
[3] A. Ansar and K. Daniilidis. Linear pose estimation from points or lines, 2003.
[4] R. Azuma. A survey of augmented reality, 1997.
[5] R. Azuma and et al. Recent advances in augmented reality, 2001.
[6] J. Chou. Quaternion kinematic and dynamic differential equations, 1992.
[7] D. Choukroun, Y. Oshman, and I. Bar-Itzkack. Novel quaternion kalman filter, 2006.
[8] J. Craig. Introduction to robotics: Mechanics and control, 2004.
[9] E. Dam, M. Koch, and M. Lillholm. Quaternions, interpolation, and animation, 1998.
[10] D. Drascic and P. Milgram. Perceptual issues in augmented reality, 1996.
[11] D. Eberly. Rotation representations and performance issues, 2002.
[12] P. Fiore. Efficient linear solution of exterior orientation, 2001.
[13] M. Fischler and R. Bolles. Random sample and consensus: A paradigm for model fitting
with applications to image analysis and automated cartography, 1981.
[14] R. Hartley. Minimizing algebraic error in geometric estimation problems, 1998.
[15] J. Heikkila and O. Silven. Calibration procedure for short focal length off-the-shelf ccd
cameras, 1996.
113
[16] J. Heikkila and O. Silven. A four-step camera calibration procedure with implicit image
correction, 1997.
[17] R. Holloway. Registration error analysis for augmented reality, 1997.
[18] R. Horaud, B. Conio, and O. Leboulleux. An analytic solution for the perspective
4-point problem, 1989.
[19] B. Horn. Closed-form solution of absolute orientation using unit quaternions, 1986.
[20] B. Horn, H. Hilden, and S. Negahdariour. Closed-form solution of absolute orientation
using orthonormal matrices, 1988.
[21] J. LaViola Jr. A comparision of the unscented and extended kalman filtering of estimation quaternion motion, 2003.
[22] S. Julier and J. Uhlmann. A new extension of the kalman filter to nonlinear systems,
1997.
[23] J. Lasserre. Global optimization with polynomials and the problem of moments, 2001.
[24] P. Lee and J. Moore. Gauss-newton-on-manifold for pose estimation, 2005.
[25] C. Lu, G. Hager, and E. Mjolsness. Fast and globally convergent pose estimation from
video images, 2000.
[26] J. Maida, C. Bowen, and J. Pace. Improving robotic operator performance using augmented reality, 2007.
[27] F. Moreno-Noguer, P. Fua, and V. Lepetit. Accurate non-iterative o(n) solution to the
pnp problem, 2007.
[28] U. Neumann and A. Majoros. Cognitive, performance, and systems issues for augmented
reality applications in manufacturing and maintenance, 1998.
[29] K. Nguyen. Inertial data fusion using kalman filter methods for augmented reality, 1998.
[30] E. Pervin and J. Webb. Quaternions in computer vision and robotics, 1982.
[31] L. Quan and Z. Lan. Linear n-point camera pose determination, 1999.
114
[32] H. Regenbrecht and T. Schubert. Measuring presence in augmented reality environments: Design and a first test of a questionnaire, 2002.
[33] K. Shoemake. Quaternions, 2007.
[34] K. Stanney, R. Mourant, and R. Kennedy. Human factors issues in virtual environments:
A review of the literature, 1998.
[35] S. Thrun, W. Burgard, and D. Fox. Probabilistic robotics, 2001.
[36] N. Trawny and S. Roumeliotis. Indirect kalman filter for 3d attitude estimation, 2005.
[37] B. Triggs. Camera pose and calibration from 4 or 5 known 3d points, 1999.
[38] R. Tsai. A versatile camera calibration technique for high-accuracy 3d machine vision
metrology using off-the-shelf tv cameras and lenses, 1987.
[39] A. Ude. Filtering in a unit quaternion space for model-based object tracking, 1999.
[40] S. Umeyama. Least-squares estimation of transformation parameters between two point
patterns, 1991.
[41] T. Vincent and W. Hoff. Analysis of head pose accuracy in augmented reality, 2000.
[42] W. Wolfe, D. Mathis, C. Sklair, and M. Magee. The perspective view of three points,
1991.
[43] R. Zanetti and R. Bishop. Quaternion estimation and norm constrained kalman filtering,
2006.
115
Chapter 9
Appendix
9.1
DLT Parameters
D = −(r31 X0 + r32 Y0 + r33 Z0 )
u0 r31 − fu r11
u0 r32 − fu r12
u0 r33 − fu r13
, l2 =
, l3 =
D
D
D
(9.2)
(fu r11 − u0 r31 )X0 + (fu r12 − u0 r32 )Y0 + (fu r13 − u0 r33 )Z0
D
(9.3)
v0 r31 − fv r21
v0 r32 − fv r22
v0 r33 − fv r23
, l6 =
, l7 =
D
D
D
(9.4)
(fv r21 − v0 r31 )X0 + (fv r22 − v0 r32 )Y0 + (fv r23 − v0 r33 )Z0
D
(9.5)
r31
r32
r33
, l10 =
, l11 =
D
D
D
(9.6)
l1 =
l4 =
l5 =
l8 =
(9.1)
l9 =
l1 X0 + l2 Y0 + l3 Z0 = −l4
l5 X0 + l6 Y0 + l7 Z0 = −l8
l9 X0 + l10 Y0 + l11 Z0 = −1
116
(9.7)
Intrinsic Parameters
D2 =
1
l92
+
2
l10
(9.8)
2
+ l11
u0 = D2 (l1 l9 + l2 l10 + l3 l11 )
(9.9)
v0 = D2 (l5 l9 + l6 l10 + l7 l11 )
fu2 = D2 (u0 l9 − l1 )2 + (u0 l10 − l2 )2 + (u0 l11 − l3 )2
fv2 = D2 (v0 l9 − l5 )2 + (v0 l10 − l6 )2 + (v0 l11 − l7 )2
9.2
9.2.1
(9.10)
Simplex SDP
Three Variable/Vector Case
Distance Constraint
2
dl = (βa vai + βb vbi + βc vci ) − (βa vaj + βb vbj + βc vcj )
C
(9.11)
Dot Product Constraint
C
dl = (βa vai + βb vbi + βc vci ) − (βa vak + βb vbk + βc vck ) ·
(βa vaj + βb vbj + βc vcj ) − (βa vak + βb vbk + βc vck )
(9.12)
Objective Function
l
l
l
l
l
l
βa βc + C011
βb βc ) −
βb2 + C002
βc2 + C110
βa βb + C101
El = (C200
βa2 + C020
LMI Constraints
117
M
dl
2
(9.13)
h
iT
X1 = βa βb βc
(9.14)
h
iT
X2 = 1 βa2 βb2 βc2 βa βb βa βc βb βc
(9.15)
#
βa2 − βb2
0
M0 =
0
βa2 − βc2
 2

βa βa βb βa βc


M1 = βa βb βb2 βb βc 
βa βc βb βc βc2


βa βb
βa βc
βb βc
βb2
βc2
1
βa2

 2
 βa
βa3 βc βa2 βb βc 
βa3 βb
βa2 βc2
βa2 βb2
βa4


2 2
4
2 2
3
2
3

 β2
β
β
β
β
β
β
β
β
β
β
β
β
a
a
c
c
a
c
b
b
b
b
b
b
b



 2
M2 =  βc
βb βc3 
βa βb βc2 βa βc3
βc4
βb2 βc2
βa2 βc2


βa βb β 3 βb
βa βb3 βa βb βc2 βa2 βb2 βa2 βb βc βa βb2 βc 
a



2
2 2
2
2
3
3
βa βc βa βc βa βb βc βa βc βa βb βc βa βc βa βb βc 
βb βc βa2 βb βc βb3 βc
βb βc3 βa βb2 βc βa βb βc2 βb2 βc2
(9.16)
"
9.2.2
Four Variable/Vector Case
Distance Constraint
2
dl = (βa vai + βb vbi + βc vci + βd vdi ) − (βa vaj + βb vbj + βc vcj + βd vdj )
C
(9.17)
Dot Product Constraint
C l
d = (βa vai + βb vbi + βc vci + βd vdi ) − (βa vak + βb vbk + βc vck + βd vdk ) ·
(βa vaj + βb vbj + βc vcj + βd vdj ) − (βa vak + βb vbk + βc vck + βd vdk )
Objective Function
118
(9.18)
l
l
l
l
βd2
βc2 + C0002
βb2 + C0020
βa2 + C0200
El = ((C2000
l
l
l
l
l
l
βb βd + C0011
βc βd ) −
βb βc + C0101
βa βd + C0110
βa βc + C1001
βa βb + C1010
+ C1100
M
dl )2
(9.19)
LMI Constraints. Matrix M2 is too large to fit on the page!
h
iT
X1 = βa βb βc βd
(9.20)
h
iT
X2 = 1 βa2 βb2 βc2 βd2 βa βb βa βc βa βd βb βc βb βd βc βd
(9.21)
 2
0
βa − βb2

M0 =  0
βa2 − βc2
0
0

βa2 βa βb βa βc

 βa βb βb2 βb βc
M1 = 
β β β β
2
 a c b c βc
βa βd βb βd βc βd
9.3

0

0 
βa2 − βd2

βa βd

βb βd 

βc βd 

2
βd
(9.22)
Chi-Square Distribution
χ2k
=
k
X
(x̂ − xi )2
σx2
i=1
(9.23)
The PDF and CDF of the chi-square distribution...

 xk/2−1 e−x/2 , x > 0
2k/2 Γ(k/2)
fχ (x, k) =
0,
x≤0
Fχ (x, k) =
γ(k/2, x/2)
Γ(k/2)
(9.24)
(9.25)
The gamma, upper incomplete gamma, and lower incomplete gamma functions...
119
∞
Z
ta−1 e−t dt
Γ(a) =
(9.26)
0
∞
Z
Γ(a, x) =
ta−1 e−t dt
(9.27)
ta−1 e−t dt
(9.28)
x
Z
x
γ(a, x) =
0
120
9.4
Computer Vision Algorithm Comparison
Figure 9.1: Group 1, 10 Points, Top: σuv = 5, Bottom: σuv = 10
121
Figure 9.2: Group 1, 25 Points, Top: σuv = 5, Bottom: σuv = 10
122
Figure 9.3: Group 2, 10 Points, Top: σuv = 5, Bottom: σuv = 10
123
Figure 9.4: Group 2, 25 Points, Top: σuv = 5, Bottom: σuv = 10
124
Figure 9.5: Group 3, 10 Points, Top: σuv = 5, Bottom: σuv = 10
125
Figure 9.6: Group 3, 25 Points, Top: σuv = 5, Bottom: σuv = 10
126
Figure 9.7: Group 4, 10 Points, Top: σuv = 5, Bottom: σuv = 10
127
Figure 9.8: Group 4, 25 Points, Top: σuv = 5, Bottom: σuv = 10
128
Figure 9.9: Group 5, 10 Points, Top: σuv = 25, Middle: σuv = 50, Bottom: σuv = 75
129
Download