A NEW VISION & INERTIAL POSE ESTIMATION SYSTEM FOR HANDHELD AUGMENTED REALITY —– DESIGN, IMPLEMENTATION & TESTING John M. Steinbis Department of Engineering Colorado School of Mines March 19, 2008 Abstract The research contained within this thesis may be summarized as the design, implementation, and testing of a new vision and inertial pose estimation system for registration within handheld augmented reality (AR). Additionally, a real-time AR testbed was constructed from off-the-shelf components and free or open source software libraries. It runs at frame rate, 30 Hz, and maintains highly accurate dynamic overlay. Significant contributions were made in pose estimation and system architecture. AR is a combination of computer generated stimulus with the real-world which aims to augment human experience or perception. Most often this stimulus consists only of visual data, but audio and haptics may also be used. The use of three different 3D models demonstrates AR capabilities within four application areas: C130 Hercules - training, Kuka manipulator - education and teleoperation support, Matlab peaks plot - complex data visualization. Traffic cones are utilized as 3D fiducials within a potentially large work volume. Planar patterns unnecessarily constrain user motion. The system is scalable because any size cone can be used, 4” to 36”. The cones are also inexpensive, durable, stackable, and may be quickly placed for outdoor use. A quaternion-based EKF was implemented to filter sensor measurements and avoid singularities found in other rotation parameterizations. A new object space pose uncertainty algorithm with quaternion-normality constraint was designed and verified with a Monte Carlo method. This uncertainty calculation is needed for systems that rely on a quaternion-based estimator and object space algorithms such as the Orthogonal Iteration Algorithm (OIA). Several new computer vision algorithms were created and tested in Matlab. The algorithms are compared against popular existing algorithms such as the DLT and OIA. Results influenced the selection of a specific algorithm combination within our pose estimation system. A 6-DOF overlay test platform was created for measuring and analyzing overlay error in the end-to-end system. Virtual and real features are segmented from a video stream to measure alignment error. Students completed a “realism” questionnaire, with and without the gyros enabled. Responses indicated the need for the IMU at times when fiducials are not visible. ii Contents List of Figures vi Chapter 1 Introduction 1 1.1 Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2 Rigid Body Pose & Kinematics 6 2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Rotation Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 2.2.1 Euler Angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Angle-Axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Quaternions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Kinematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2.1 Angular Velocity . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2.2 Euler-Angles . . . . . . . . . . . . . . . . . . . . . . . . . . 18 i 2.3.2.3 Quaternion . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 Sensors & Data Processing 3.1 3.2 21 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Pinhole Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Image Point Correction . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.3 Traffic Cones as Fiducials . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.4 Model Point Correction . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.5 Cone Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.5.1 Color Classification . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.5.2 Connected Components . . . . . . . . . . . . . . . . . . . . 31 3.1.5.3 Blob Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.5.4 Template Matching . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.6 Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.7 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Inertial Measurement Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Static Bias Correction . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2 Calibration of Relative Sensor Orientation . . . . . . . . . . . . . . . 38 Chapter 4 Single Camera, Single Frame Pose 4.1 4.2 18 39 Types of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Image Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.2 Object Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Classic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.1 Perspective-3-Point - Fischler & Bolles . . . . . . . . . . . . . . . . . 42 4.2.2 Direct Linear Transform - Abdel-Aziz & Karara . . . . . . . . . . . . 44 4.2.3 “Absolute Orientation” - Horn . . . . . . . . . . . . . . . . . . . . . . 48 ii 4.2.3.1 Optimal Rotation - SVD . . . . . . . . . . . . . . . . . . . . 49 4.2.3.2 Optimal Rotation - Quaternion . . . . . . . . . . . . . . . . 49 Orthogonal Iteration Algorithm - Lu & Hager . . . . . . . . . . . . . 50 Experimental Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.1 Modified DLT + Gauss-Newton . . . . . . . . . . . . . . . . . . . . . 51 4.3.1.1 Linear Transformation . . . . . . . . . . . . . . . . . . . . . 52 4.3.1.2 Image Space System . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1.3 Object Space System . . . . . . . . . . . . . . . . . . . . . . 53 4.3.1.4 Null Space Solution . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1.5 Final Optimization . . . . . . . . . . . . . . . . . . . . . . . 55 3-Simplex + Gauss-Newton . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.2.1 3-Simplex and Barycentric Coordinates . . . . . . . . . . . . 57 4.3.2.2 Measurement Matrices . . . . . . . . . . . . . . . . . . . . . 59 4.3.2.3 Null Space Solution . . . . . . . . . . . . . . . . . . . . . . 60 4.3.2.4 Final Optimization . . . . . . . . . . . . . . . . . . . . . . . 61 3-Simplex SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.3.1 Distance Constraint . . . . . . . . . . . . . . . . . . . . . . 63 4.3.3.2 ”Linearization” + Gauss-Newton . . . . . . . . . . . . . . . 64 4.3.3.3 Dot Product Constraint . . . . . . . . . . . . . . . . . . . . 65 4.3.3.4 Orthogonal Triad Constraints . . . . . . . . . . . . . . . . . 65 4.3.3.5 Polynomial Semidefinite Program . . . . . . . . . . . . . . . 66 4.2.4 4.3 4.3.2 4.3.3 Chapter 5 Recursive Estimation 70 5.1 Recursive Estimation of Markov Processes . . . . . . . . . . . . . . . . . . . 71 5.2 Gaussian & Nonparametric Filters . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 Selection of Quaternion EKF . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 iii 5.4 5.5 Time Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.1 State Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.2 State Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4.3 Uncertainty Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Measurement Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5.1 Camera Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.5.2 Gyro Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.5.3 Accelerometer Update . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.5.3.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.5.3.2 Position and Orientation . . . . . . . . . . . . . . . . . . . . 81 Quaternion Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Camera Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.6.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.6.2 Image Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.6.3 Object Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.6.4 Quaternion Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.5.4 5.6 Chapter 6 Real-time System 89 6.1 Hardware Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2 Software Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3 Measurement Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4 Vision Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4.1 Image Space Error Residual Constraint . . . . . . . . . . . . . . . . . 97 6.4.2 Relative Change in Motion Constraint . . . . . . . . . . . . . . . . . 98 Graphics Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.5 Chapter 7 Experiments 102 iv 7.1 Computer Vision Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.1.1 Overall Algorithm Comparison . . . . . . . . . . . . . . . . . . . . . 103 7.1.2 Best Algorithm Combinations . . . . . . . . . . . . . . . . . . . . . . 105 7.1.3 Camera Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2 Overlay Test Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.3 Human Testing - Realism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Chapter 8 Summary and Conclusions 111 Bibliography 113 Chapter 9 Appendix 116 9.1 DLT Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 9.2 Simplex SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 9.2.1 Three Variable/Vector Case . . . . . . . . . . . . . . . . . . . . . . . 117 9.2.2 Four Variable/Vector Case . . . . . . . . . . . . . . . . . . . . . . . . 118 9.3 Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 9.4 Computer Vision Algorithm Comparison . . . . . . . . . . . . . . . . . . . . 121 v List of Figures 3.1 Perspective Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Cone Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Model Point Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Cone Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Cone Blob 29 3.6 RGB Color Space 3.7 HSI Color Space 3.8 8-Connectivity 3.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Pyramid Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.10 Cone with Ellipse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.12 IMU Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 Error Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Perspective Tetrahedron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Congruent Triangle Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 Markov Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 EKF Prediction and Correction Cycle . . . . . . . . . . . . . . . . . . . . . . 76 5.3 Hyperplane Tangent to Hypersphere . . . . . . . . . . . . . . . . . . . . . . 88 3.11 Cone Templates vi 6.1 Hardware Components & Sensors . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2 Sensors & LCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3 Libraries / Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4 Full System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.5 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.6 Measurement Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.7 Measurement Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.8 Vision Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.9 Graphics Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.10 C-130 Hercules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.11 Kuka KR6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.12 Peaks Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.1 Computer Vision Algorithm Test Groups . . . . . . . . . . . . . . . . . . . 103 7.2 “Best Algorithm Combinations” 7.3 Camera Uncertainty, Image Space, Euler Angles 7.4 Camera Uncertainty, Image Space, Quaternion 7.5 Camera Uncertainty, Object Space, Euler Angles 7.6 Camera Uncertainty, Object Space, Quaternion . . . . . . . . . . . . . . . . 107 7.7 Overlay Test Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.8 Overlay Accuracy, without Gyros (Inches) . . . . . . . . . . . . . . . . . . . 109 7.9 Overlay Accuracy, with Gyros (inches) . . . . . . . . . . . . . . . . . . . . . 109 . . . . . . . . . . . . . . . . . . . . . . . . 105 . . . . . . . . . . . . . . . 106 . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . 107 7.10 Realism Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 9.1 Group 1, 10 Points, Top: σuv = 5, Bottom: σuv = 10 . . . . . . . . . . . . . 121 9.2 Group 1, 25 Points, Top: σuv = 5, Bottom: σuv = 10 . . . . . . . . . . . . . 122 9.3 Group 2, 10 Points, Top: σuv = 5, Bottom: σuv = 10 . . . . . . . . . . . . . 123 vii 9.4 Group 2, 25 Points, Top: σuv = 5, Bottom: σuv = 10 . . . . . . . . . . . . . 124 9.5 Group 3, 10 Points, Top: σuv = 5, Bottom: σuv = 10 . . . . . . . . . . . . . 125 9.6 Group 3, 25 Points, Top: σuv = 5, Bottom: σuv = 10 . . . . . . . . . . . . . 126 9.7 Group 4, 10 Points, Top: σuv = 5, Bottom: σuv = 10 . . . . . . . . . . . . . 127 9.8 Group 4, 25 Points, Top: σuv = 5, Bottom: σuv = 10 . . . . . . . . . . . . . 128 9.9 Group 5, 10 Points, Top: σuv = 25, Middle: σuv = 50, Bottom: σuv = 75 viii . . 129 Chapter 1 Introduction 1.1 Augmented Reality Augmented Reality (AR) is a combination of computer generated stimulus with the realworld which aims to augment human experience or perception. It is similar to Virtual Reality (VR); however, in VR the user is completely immersed in a synthetic environment; whereas, in AR the user still has the ability to see and interact with the real world. Therefore, AR supplements reality rather than completely replacing it. Ideally, the user perceives computer generated stimulus as coexisting in the same environment. Most often this stimulus consists only of visual data, but audio and haptics may also be used. The field is currently receiving much attention from researchers in both academia and industry. Most of the recent interest results from extremely powerful hardware at decreasing cost that allows capable and affordable systems to be constructed from off-the-shelf components. Azuma [4] [5] are good resources for general background augmented reality information. The primary motivation behind AR is that it allows information to be provided to the user that cannot be detected directly with his own senses. This concept may be applied within an unlimited number of domains. Successful implementations have existed for a few years now, [26], but significant improvements are still needed before the technology is widely accepted. Complicated tasks, such as assembly, maintenance, and surgery, may be supported by the insertion of additional information and visualization of hidden objects. Instructions and maps may be presented to users during navigation within buildings, outdoors, or while driving. Visualization of architecture, historic buildings, planned construction projects, and 1 other complex data sets is much more meaningful to a person if viewed three-dimensionally as part of an actual scene. To avoid the high risk or cost of training, flight and driving simulators are another attractive application area. In robotics, sensitive tasks, such as path planning, may be supported by augmented teleoperation or telepresense. Finally, entertainment and education may benefit from virtual objects in the classroom, museum exhibitions, theme park attractions, and interactive games. For visual augmentation, AR requires a device to present data to the user. Headmounted displays (HMD) are most common, but a monitor/LCD may be used also. Basically, two different types of technologies exist, optical and video see-through. Display options are covered in more detail within chapter six. 1.2 Registration The major challenge for AR systems is how to combine the real and virtual world into a single environment while maintaining the illusion that virtual objects are indeed part of the real world. Consistent registration is required or the illusion may be compromised. More seriously, many applications demand a high degree of accuracy. Specifically, the relationship of the user and objects within the world must be known for objects to be aligned correctly. Typically, the relationship of objects to the world is static while the user’s relationship to the world is dynamic. It is natural for the user to move to different locations and view objects from different angles. Obviously, an AR system cannot control the motions of the user. The user can look or move anywhere and the system must respond almost immediately. Registration errors result from numerous sources and can be divided into two types, static and dynamic. Static registration errors include both misalignment and jitter, or shaking. Misalignment errors are usually caused by poor calibration of sensors and optics. Jitter is caused by noise in highly sensitive sensors. With dynamic error, virtual objects appear to swim and lag thereby becoming disconnected from real objects. The primary causes of this error are system delays, or lags, and include sources such as sensor processing and graphics generation. The total (end-to-end) system delay can be measured as the time from when the tracking system takes a measurement to when computer generated data is presented to the user. System delays cause registration errors only when motion occurs. Tracking accuracy is another cause of dynamic error. For most systems it is more difficult to accurately determine pose when fast motion occurs. Many perceptual issues arise as a result of registration 2 error. The mind compensates for small misalignment, but at some point, the system begins to degrade. In general, error during quick user motion is less detectable. A threshold of tolerable registration error has yet to be well established. Refer to [34], [28], [17], and [10] for a more detailed discussion. 1.3 Pose Estimation The combination of position and orientation is referred to as pose. Real-time pose estimation and graphics generation are a necessity to create a realistic augmented environment. However, pose estimation is in no way limited to AR applications; navigation and robotics are two classic examples of fields that also depend on accurate pose estimation. One of the oldest form of navigation relies on the sun and other stars. This was especially important for sailors in the middle of the ocean where no land marks exist. Celestial navigation is now used on spacecraft by an optical system referred to as a ”star tracker”. On the other hand, for a robot to successfully manipulate or avoid collision with an object, it must know the relationship of the object with respect to itself and the world. A variety of sensors may be used for pose estimation. Cameras are a passive optical sensor because they only receive light. Active optical sensors, such as LIDAR, emit light. Other sensors include acoustic, magnetic, inertial, GPS, and wheel encoders. Of course, they all have strengths and weaknesses, so for this reason multiple sensors are often used to complement each other. Computer vision techniques applied to camera images is probably the most popular method of close range pose estimation. Given that the sensor is calibrated, the mapping from 3D points in the scene to 2D points in the image is well known. Therefore, the projected image of an object (of known geometry) is a function of the object’s pose only. Once a set of control points on the object (typical corners or other feature points) have been identified, it is then possible to solve for the pose from a set of equations which relate the 3D coordinates with their 2D image coordinates. Naturally occurring features may be used as control points, but they are typically difficult to reliably detect. Unnatural features are often intentionally introduced because they are easier to detect or unique natural features simply do not exist. A few examples include planar patterns, concentric dots, and LEDs. 3 1.4 Thesis Contributions The research project initially started as part of a military training program for C130 loadmaster students. The military’s goals were to save money by reducing the time spent in an actual aircraft and be able to simulate dangerous scenarios without the risks of trainee health and equipment damage. Original design concepts of an outdoor tracking system were handed over to the Air Force, and since then the project has grown into something much different. An actual AR system was constructed at the Colorado School of Mines specifically for this research project. It was created for algorithm development and to facilitate future research in the field. The system runs at frame rate, 30 Hz, and maintains highly accurate dynamic overlay. Significant contributions were made in pose estimation and system architecture. Specifically, • Development of a real-time augmented reality system utilizing 3D cone fiducials. • Development of new pose estimation algorithms, and analysis against the state of the art. • Development of a new Quaternion-pose uncertainty calculation for object space computer vision algorithms, and verification with a Monte Carlo method. • Experimental analysis of static and dynamic registration performance with different tracking strategies, including both automated numeric and human testing results. 1.5 Thesis Overview The research contained within this thesis may be summarized as the design, implementation, and testing of a new vision and inertial pose estimation system for registration with a low cost handheld augmented reality device. Background information on rigid body pose, rotation parameterization, and kinematics is covered in chapter two. This information is essential for understanding the goal of a pose estimation system and serves as a foundation for advanced concepts in later chapters. Chapters 3-6 cover the design and implementation of the system. These chapters are very 4 math intensive and attempt to fully explain how a camera and inertial sensor are used to estimate the 6-DOF pose of an object for the purpose of registration. In chapter seven, various tests of the system are presented. Matlab is used to test several computer vision algorithms to see how well they perform under various conditions, such as number of feature points and amount of noise. A new computer vision algorithm is tested against existing algorithms as a comparison. Monte Carlo simulations are used to verify a new object space uncertainty calculation. An overlay test is used to determine the quality of overlay in the end-to-end system. Finally, results of an augmented reality questionnaire are provided. The questionnaire was completed by students after being given the opportunity to use the system firsthand with several different virtual objects. 5 Chapter 2 Rigid Body Pose & Kinematics This chapter serves as a foundation for much of the thesis. First, rigid body pose, position and orientation, is covered extensively and several rotation parameterizations are discussed. Then, rigid body kinematics is covered, specifically angular velocity. Some of the material might be a review for those who are already familiar with spatial transformations, but it still serves as a convenient reference for advanced concepts developed in later chapters. Many basic concepts come from Craig [8] and advanced concepts, such as quaternions, come from various other sources including [11], [9], [6], [33], [30]. 2.1 Basic Concepts A rigid body is an idealization of a solid body of finite size in which deformation is neglected. In other words, the distance between any two points on a rigid body remains constant in time. The pose of a rigid body can be described by a combination of translation and rotation from a given reference frame. For this purpose, a reference frame is selected to be rigidly attached to the body itself. This is typically referred to as the ”object” reference frame {O}. The position of its origin and the orientation of its axes with respect to a given “global” or “world” reference frame {W} represent the pose of the body. A reference frame is defined as a right-handed orthogonal triad of unit vectors, X̂, Ŷ , Ẑ, that share a common point of intersection at the origin. The triad forms a basis for the coordinate system. The orientation of one coordinate system, {B}, with respect to another, 6 {A}, may be described by the direction cosine matrix (DCM), X̂B · X̂A ŶB · X̂A ẐB · X̂A A ŶB · ŶA ẐB · ŶA , BR = X̂B · ŶA X̂B · ẐA ŶB · ẐA ẐB · ẐA (2.1) often referred to as a rotation matrix. By close inspection of rows and columns we see that A BR = h A X̂B A A ŶB ẐB i B T X̂A B T = ŶA . B T ẐA (2.2) Clearly, the orientation of frame {A} relative to {B} is the transpose of the DCM above. Hence, the inverse of a DCM is equal to its transpose, A BR T B −1 = B , AR = AR (2.3) and therefore the matrix is orthogonal. Since 1 = det(I) = det(RT R) = det(RT ) det(R) = det(R)2 an orthonormal matrix has determinant equal to ±1, and thus the mapping defined by this matrix preserves vector length. Additionally, vector orientation is preserved or reversed depending on whether the determinant is positive or negative. A proper rotation matrix has determinant equal to +1 and belongs to the special orthogonal group, SO(3). An improper rotation matrix has determinant equal to −1 and does not belong to the rotational group. Consecutive multiplication of rotation matrices often arises in many fields, such as manipulator kinematics, where multiple reference frame exist. Multiplication of rotation matrices is associative, A B C BR( CR DR) B C A B C = (A BR CR) DR = BR CR DR , (2.4) A B BR CR (2.5) but not commutative, A 6= B CR BR . It is likely that the origins of two reference frames are not coincident. The displacement between origins is known as translation. In space, a three element vector is used to describe the position of a reference frame’s origin relative to another frame. For example, 7 the translation of frame {B} relative to frame {A} is equivalent to the position of frame {B}’s origin within frame {A}, written as A tB = APBorg . (2.6) The homogeneous transformation matrix, r11 r21 H= R t = r 31 0 1 0 r12 r22 r32 0 r13 r23 r33 0 tx ty tx , (2.7) 1 is a convenient way to combine rotation and translation into a single matrix. An arbitrary point, h iT A A A A Pi = Xi Yi Zi , (2.8) of known location in one reference frame, {A}, may be expressed in another frame, {B}, by a rotation and translation, B A B Pi = B (2.9) AR Pi + tA . With the homogeneous transform, a point set of arbitrary size, A P∗ = h A P1 . . . A i Pn , (2.10) can be transformed in a single matrix multiplication, " # " # A P∗ P∗ B = AH . 1 1 B (2.11) This operation is more difficult to achieve when the rotation and translation are used separately. 2.2 Rotation Parameterization According to Euler’s rotation theorem the general displacement of a rigid body (or coordinate frame) with one point fixed is described by a rotation about some axis. Furthermore, such a rotation may be uniquely described by a minimum of three parameters. However, for various 8 reasons, there are several ways to represent such a rotation. Many of these representations use more than the necessary minimum of three parameters; although such representations still only have three degrees of freedom. Cayley’s transform for orthogonal matrices states that any orthogonal matrix, O of any size, which does not have −1 as an eigenvalue can be represented by a skew-symmetric matrix, S = −S T , via O = (I − S)−1 (I + S) . (2.12) Similarly, the skew-symmetric matrix may be computed as S = (O − I)(O + I)−1 , (2.13) which illustrates that the nine elements of a rotation matrix are not independent and may be described by a minimum of 3 parameters, 0 −sz sy S3 = sz 0 −sx . −sy sx 0 (2.14) However, the mapping from DCM to skew symmetric matrix is unbounded for certain orthogonal matrices. Furthermore, six constraints may be placed on the nine elements of the DCM leaving only three degrees of freedom. By letting X̂, Ŷ , and Ẑ equal the rows or columns of the rotation matrix, the constraints may be written as kX̂k = kŶ k = kẐk = 1 , (2.15) X̂ · Ŷ = X̂ · Ẑ = Ŷ · Ẑ = 0 . (2.16) Using the direction cosines (or DCM) requires a total of nine parameters. Commonly used parameter triples include Euler angles, Cardan/Byrant angles, and Rodrigues parameters. Four parameter representations include Euler angle-axis, Euler parameters, and Cayley-Klien parameters. Only the four parameter representations behave well for arbitrary rotations because a nonsingular mapping to a rotation matrix requires a set of at least four parameters. 9 2.2.1 Euler Angles Euler angles are the most commonly used parameter triplet and consist of three consecutive rotations about orthogonal axes, i.e. 1 0 0 c(β) 0 s(β) c(α) −s(α) 0 Rx (γ) = 0 c(γ) −s(γ) , Ry (β) = 0 1 0 , Rz (α) = s(α) c(α) 0 , 0 s(γ) c(γ) −s(β) 0 c(β) 0 0 1 (2.17) 4! = 24 possible rotation combinations where s(θ) = sin(θ) and c(θ) = cos(θ). There are (4−3)! so it is very important to specify the convention used. Many problems arise from this ambiguity. For example, the XYZ and ZYX Euler angle conventions are R(θxyz ) = Rx (γ)Ry (β)Rz (α) (2.18) R(θzyx ) = Rz (α)Ry (β)Rx (γ) (2.19) and respectively. In many domains Euler angles provide a very intuitive parameterization of rotation, such as manipulator kinematics, which is a likely cause of such popularity. However, they do suffer from singularities in their mapping from DCM to rotation angles and are not a good parameterization choice for arbitrary rotations. For example, the two conventions shown above both have singularities at |β| = π2 . The situation when singularities are encountered is referred to as “gimbal lock”; effectively, a degree of freedom has been lost which the parameterization cannot handle. 2.2.2 Angle-Axis Decomposition of a rotation matrix, R, will yield eigenvalues of {1, e±iθ } or equivalently {1, cos(θ) ± i sin(θ)}. The product of the eigenvalues is equal to the determinant and thus equal to +1. The angle θ which appears in the eigenvalue expression corresponds to the magnitude of rotation in Euler’s angle-axis representation. The eigenvector corresponding to the +1 eigenvalue is the accompanying Euler axis. A rotation matrix parameterized by 10 the Euler angle-axis may be written as kx2 va + ca kx ky va − kz sa kx ky va + ky sa R = kx k2 va + kx sa ky kz va − kx sa ky2 va + ca kx kz va − ky sa ky kz va + kx sa kz2 va + ca (2.20) where ca = cos(θ), sa = sin(θ), and va = 1 − cos(θ). The expression may be written in a more compact form that uses a skew symmetric matrix similar to those found in Cayley’s transform for orthogonal matrices, specifically R = I + sin(θ)S + (1 − cos(θ))S 2 , (2.21) where 0 −k̂z k̂y S = k̂z 0 −k̂x . −k̂y k̂x 0 (2.22) This leads to a calculation of the Euler angle-axis that does not require eigenvalue decomposition of the rotation matrix. Some simple algebra yields the following result. First, the angle may be calculated from the trace of the matrix. −1 θ = cos tr(R) − 1 2 , θ ∈ [0, π] (2.23) When θ = 0 any unit vector is a valid axis because there is no rotation. If θ ∈ (0, π), the axis may be extracted directly from S= R − RT 2 sin(θ) (2.24) so that r32 − r23 1 k= r13 − r31 . 2 sin(θ) r21 − r12 11 (2.25) However, when θ = π, the axis must be constructed differently since R − RT = 0 and 1 − 2(ky2 + kz2 ) 2kx ky 2kx kz R = I3 − 2 S 2 = 2ky kz 2kx ky 1 − 2(kx2 + kz2 ) . 2kx kz 2ky kz 1 − 2(kx2 + ky2 ) (2.26) The most stable way of calculating the axis is to extract the maximum component from the diagonal entries of the rotation matrix. From the diagonal terms we obtain kx2 = 1 + r11 − r22 − r33 1 − r11 + r22 − r33 1 − r11 − r22 + r33 , ky2 = , kz2 = , 4 4 4 (2.27) and from the off-diagonal terms we obtain kx ky = r12 r21 r13 r31 r23 r32 = , kx kz = = , ky kz = = . 2 2 2 2 2 2 (2.28) To calculate the axis simply choose the equation from equation 2.27 that yields the largest value and then calculate the remaining two parameters from equation 2.28. It is worth noting that the uniqueness of the angle-axis is only two because the opposite angle and axis represent the same rotation. i.e. R(k̂, θ) = R(−k̂, −θ). Therefore, the inverse rotation may be found by negating either the axis or angle. The axis is most often negated so that the angle remains within the domain specified earlier, [0, π]. 2.2.3 Quaternions Quaternions were developed by Hamilton in 1843 and are a non-commutative extension of complex numbers into four-dimensions. Although they have been superseded in most applications by ordinary vector algebra, they still find use in both theoretical and applied mathematics, in particular, calculations involving three-dimensional spatial rotations. The set of quaternions is defined as Q = {q0 + qx i + qy j + qz k | q0 , qz , qy , qz ∈ R} . 12 (2.29) The quaternion may be viewed as a linear combination of a scalar and vector, q = q0 + ~q. As an ordinary vector, the quaternion will be written as h iT h iT . q = q0 ~q = q0 qx qy qz (2.30) A unit quaternion satisfies the normality constraint kqk = q02 + qx2 + qy2 + qz2 = 1 , (2.31) and like any other vector, a quaternion can be “normalized” to unit length by simply dividing q . The quaternion conjugate is essentially the same as the regular complex the norm, q̂ = kqk conjugate, h iT q ∗ = q0 − ~q = q0 −qx −qy −qz , (2.32) and the quaternion inverse is the conjugate divided by the norm, i.e. q −1 = q∗ . kqk (2.33) Euler-Rodrigues parameters are a unit quaternion. Rodrigues proposed the idea around the same time Hamilton invented quaternion theory. The quaternion represents a rotation in terms of the Euler angle-axis as θ θ k̂ . , ~q = sin q0 = cos 2 2 (2.34) Therefore, the identity quaternion of rotation has a +1 scalar and a zero vector, iT qI = 1 0 0 0 . h (2.35) The singularity within the angle-axis representation occurs at zero rotation because the axis is undefined. As shown above, quaternions do not suffer from this problem; hence, they have a nonsingular mapping. A rotation matrix parameterized by the quaternion may be written as 2 q0 + qx2 − qy2 − qz2 2(qx qy − q0 qz ) 2(qx qz + q0 qy ) R = 2(qx qy + q0 qz ) q02 − qx2 + qy2 − qz2 (2.36) 2(qy qz − q0 qx ) . 2(qx qz − q0 qy ) 2(qy qz + q0 qx ) 13 q02 − qx2 − qy2 + qz2 For unit quaternions the expression can be simplified along the diagonal, so that 1 − 2(qy2 + qz2 ) 2(qx qy − q0 qz ) 2(qx qz + q0 qy ) R = 2(qx qy + q0 qz ) 1 − 2(qx2 + qz2 ) 2(qy qz − q0 qx ) . 2(qx qz − q0 qy ) 2(qy qz + q0 qx ) 1 − 2(qx2 + qy2 ) (2.37) This leads to a calculation of the rotation quaternion from the rotation matrix directly instead of using the angle-axis. From the diagonal terms we obtain 1 + r11 + r22 + r33 1 + r11 − r22 − r33 , qx2 = 4 4 1 − r11 + r22 − r33 1 − r11 − r22 + r33 2 2 qy = , qz = , 4 4 q02 = (2.38) and from the off-diagonal terms we obtain r32 − r23 r13 − r31 r21 − r12 , q 0 qy = , q 0 qz = 4 4 4 r13 − r31 r23 − r32 r12 − r21 , q x qz = , q y qz = . q x qy = 4 4 4 q0 qx = (2.39) The most stable way of calculating the quaternion is to extract the maximum component from the diagonal terms of the rotation matrix. Choose the equation from 2.38 that yields the largest value and then calculate the remaining three parameters from equation 2.39. Like the angle-axis, the uniqueness of the rotation quaternion is only two because the negated quaternion represents the same rotation, R(q) = R(−q). In other words, two opposite poles on the hypersphere represent the same rotation. The equation i2 = j 2 = k 2 = ijk = −1 is the fundamental formula for multiplication of complex numbers in four dimensions from which a set of quaternion multiplicate identities can be derived, ij = k , ji = −k jk = i , kj = −i (2.40) ki = j , ik = −j . From the derivation, we can see that quaternion multiplication, ⊗, is associative (and distributive) but clearly not commutative. A 3-dimensional vector, such as position, can be represented as a quaternion with zero 14 scalar part, " v= 0 B Pi # " , v0 = 0 A Pi # . (2.41) The vector can be rotated directly with quaternion multiplication without first converting back into a rotation matrix. Let R(q) = A BR, then v 0 = q̂ ⊗ v ⊗ q̂ −1 . (2.42) The same operation can be performed with ordinary vector multiplication of right-side and left-side skew symmetric matrices that obey the multiplicate identities above, where 2.3 v 0 = QL (q)QR (q −1 ) v , (2.43) q0 −qx −qy −qz q0 −qx −qy −qz qx q0 −qz qy qx q0 q −q z y , QR (q) = . QL (q) = q qx q0 −qx y qz qy −qz q0 qz −qy qx q0 qz qy −qx q0 (2.44) Kinematics Kinematics is a branch of mechanics which studies the motion of objects without regard of the underlying forces involved. Dynamics accounts for some or all of the forces which impact the motion. In our case, motion refers to changes in position and orientation in time. To reiterate, we are only concerned with the motion and not the underlying forces. 2.3.1 Position The translational motion of a rigid body may be described as the time derivative of the origin’s location within a reference frame. Simply imagine a particle moving freely in space. Typically, the motion may be sufficiently represented in discrete-time by the first and second 15 derivatives, velocity and acceleration respectively. ∆t2 p̈k−1 2 ∆t2 = pk−1 + ∆t vk−1 + ak−1 2 pk = pk−1 + ∆t ṗk−1 + 2.3.2 (2.45) Orientation The rotational motion of a rigid body may be described by the time derivative of its orientation. Imagine an orthogonal triad spinning about an axis pointing out of its origin. Angular velocity is the instantaneous rotation about this axis. The length of the axis is proportional to the magnitude of rotation. 2.3.2.1 Angular Velocity A simple derivation is needed to clarify the statement made above and develop a mathematical foundation of angular velocity. Over a small time interval the rotation caused by angular velocity may be represented by a rotation matrix. The new orientation is the product of this matrix and the old orientation. Depending on the reference frame of the angular velocity the order of multiplication may be reversed. In the following derivation the angular velocity of frame {B} expressed in its own coordinate system is considered. A small rotation occurs so that A A (2.46) BRk+1 = BRk R(∆θ, k̂) . The time derivative of the rotation matrix is found in the limit as the change of time approaches zero. A BṘk −A BRk ∆t→0 ∆t A Rk R(∆θ, k̂) − A BRk = lim B ∆t→0 ∆t ! R(∆θ, k̂) − I 3 = A lim BRk ∆t→0 ∆t = lim A BRk+1 16 (2.47) Using small angle substitutions into equation 2.20 the following result is obtained. 1 −kz ∆θ ky ∆θ R(∆θ, k̂) = kz ∆θ 1 −kx ∆θ −ky ∆θ kx ∆θ 1 (2.48) So that 0 −kz θ̇ ky θ̇ A A 0 −kx θ̇ . BṘk = BRk kz θ̇ −ky θ̇ kx θ̇ 0 (2.49) The three elements of θ̇k̂ are the angular velocity vector contained within the skew symmetric matrix, 0 −ωz ωy (2.50) S(ω) = ωz 0 −ωx , −ωy ωx 0 which can be calculated as T A S( BωBk ) = A BRk BṘk . (2.51) As stated earlier, the order of multiplication changes depending on the reference frame of the angular velocity. The derivation is almost the same for the angular velocity of frame {B} expressed in frame {A}. A summary of the results is provided below. A BRk+1 = R(∆θ, k̂) A BRk (2.52) = S( AωBk ) A BRk (2.53) A T S( AωBk ) = A BṘk BRk (2.54) A BṘk Using the previous results, discrete-time rotation update can be approximately expressed as A A B k (2.55) BRk+1 = BRk I3 + ∆t S( ωB ) , where ∆t is the time between samples k and k+1. 17 2.3.2.2 Euler-Angles Even though the rotation matrix may be parameterized in a number of ways, a relationship (usually) exists between angular velocity and the time derivative of the parameterization. If singularities in the parameterization exist than the relationship is not always defined. To illustrate this point, consider the relationship between angular velocity and ZYX Euler angles. Angular velocity in terms of Euler angle time derivatives may be expressed by a (convention specific) Jacobian, A ωB = J(θxyz ) θ̇xyz γ̇ 0 0 = Rz (α)Ry (β) 0 + Rz (α) β̇ + 0 0 0 α̇ , cos(β) cos(α) − sin(α) 0 γ̇ = cos(β) sin(α) cos(α) 0 β̇ − sin(β) 0 1 α̇ (2.56) where R(θxyz ) = A BR. The inverse relationship, Euler-Angle derivatives as a function of angular velocity, may be found by symbolically inverting the Jacobian above, θ̇xyz = J(θxyz )−1 AωB = J(θxyz )−1 R(θxyz ) BωB (2.57) where J(θxyz )−1 sec(β) cos(α) sec(β) sin(α) 0 = − sin(α) cos(α) 0 . tan(β) cos(α) tan(β) sin(α) 1 (2.58) Obviously, the secant and tangent terms are unbounded near |β| = π2 leading to numerical instability. A situation like this can have catastrophic effects on a system and should be avoided if possible. 2.3.2.3 Quaternion As stated earlier, quaternions are a singularity-free parameterization of rotation. Therefore, the relationship between angular velocity and the quaternion time derivative always exists 18 and numerical instabilities due to inversion should never be encountered in practice. For this reason, quaternions are a favorite parameterization of rotation and are used in our system. The derivation of the quaternion time derivative is similar to that of rotation matrices. Over a small time interval the rotation caused by angular velocity may be represented by a quaternion. The new orientation is the product of this quaternion and the old quaternion. Depending on the reference frame of the angular velocity the order of multiplication may be reversed. In the following derivation, BωB is considered. A small rotation occurs so that qk+1 = qk ⊗ q(∆θ, k̂) , (2.59) where R(q) = A BR. The time derivative of the quaternion is found in the limit as the change of time approaches zero. qk+1 − qk ∆t→0 ∆t qk ⊗ q(∆θ, k̂) − qk = lim ∆t→0 ∆t ! q(∆θ, k̂) − qI = qk ⊗ lim ∆t→0 ∆t q̇k = lim (2.60) From the rotation matrix derivation we know that ω = θ̇k̂, and using small angle substitution into equation 2.34 the following result is obtained. " q(∆θ, k̂) = 1 # (2.61) ∆θ k̂ 2 " " # " # # 0 0 0 1 1 q̇k = qk ⊗ θ̇ = qk ⊗ B k = qk ⊗ 2 2 θ̇ k̂ ωB k̂ 2 (2.62) " # 0 Let Ω be the vector quaternion representing angular velocity, i.e. Ωk = . Then ωk 1 q̇k = qk ⊗ BΩkB 2 B k ΩB = 2 qk−1 ⊗ q̇k . (2.63) (2.64) As stated earlier, the order of multiplication changes depending on the reference frame 19 of the angular velocity. The derivation is almost the same for AωB . A summary of the results is provided below. qk+1 = q(∆θ, k̂) ⊗ qk (2.65) q̇k = A k ΩB 1A k Ω ⊗ qk 2 B (2.66) = 2 q̇k ⊗ qk−1 (2.67) Recall that quaternion multiplication can be replaced by ordinary matrix multiplication of a 4-dimensional skew symmetric matrix. The previous result may be written as q̇k = where 1 QR ( BΩkB ) qk , 2 (2.68) 0 −ωx −ωy −ωz ωx 0 ω −ω z y . QR (Ω) = ω −ω 0 ωx z y ωz ωy −ωx 0 (2.69) Using the skew symmetric matrix, the approximate discrete time quaternion update can be written as ∆t B k qk+1 = I4 + QR ( ΩB ) qk 2 (2.70) ∆t B k QR ( ΩB ) qk . = qk + 2 However, from the infinite Taylor series a pattern emerges, [36]. Assuming constant ω, the following expression is a better approximation for larger ∆t and/or ω, qk+1 = cos k BΩkB k∆t 2 1 I4 + B k sin k ΩB k 20 k BΩkB k∆t 2 QR ( BΩkB ) qk . (2.71) Chapter 3 Sensors & Data Processing The pose estimation system relies on measurements from only two sensors, the camera and IMU. This chapter discusses the processing of raw data from both sensors. The final output of the data processing will be referred to as a sensor “measurement”. The chapter is divided into two sections, camera and IMU. The information presented here is necessary to understand material found in subsequent chapters. For example, the discussion of image formation using the pinhole camera model is the basis for the entire next chapter. Fusion and filtering of the two sensors is covered in chapter 5. The data from both sensors is very different. The IMU data consists of six fields, three angular velocities and three accelerations. An image consists of 1024x768x3 (about 2.36 million) data fields. Consequently, processing a camera image is much more expensive and time consuming than the IMU. The previous chapter included a discussion of 3D points expressed in multiple reference frames. We are primarily concerned with only three reference frames, the model {M}, camera {C}, and IMU {I}. 3.1 Camera A rigid body may be parameterized by a minimum of four noncoplanar 3D points often referred to as fiducials. These points are assumed to have known locations in the model frame but not necessarily others. They are constantly used as a basis of reference and therefore must be carefully defined. These points will simply be called “model points”. 21 Model points expressed in the model reference frame are denoted by M Pi = h M M Xi Yi M Zi iT , (3.1) and it will sometimes be useful to write all model points together into a single matrix as M P∗ = h M P1 . . . i M Pn . (3.2) The projection model of the camera, covered next, requires points to be expressed in the camera frame. Model points expressed in the camera reference frame are denoted as C Pi = h C Xi C Yi C Zi iT , C P∗ = h C P1 . . . C i Pn . (3.3) A set of points expressed in two different reference frames are related by a homogeneous transformation, " # " # C M P∗ P ∗ = C . (3.4) MH 1 1 Details of the homogeneous transformation were covered in the previous chapter. If the pose of the model with respect to the camera, C MH, is known then the model points may be transformed into the camera and projected onto the image plane. In AR, this pose is very important because it is needed to correctly register virtual objects. 3.1.1 Pinhole Camera Model In the pinhole camera, rays of light travel through an infinitesimally small aperture, and the intersection of the light rays with the image plane form an image of the object as shown in figure 3.1. The mapping from three dimensions to two dimensions is called perspective projection. The focal length is the distance between the image plane and center of projection (COP). As the focal length decreases, the viewing angle increases. The perpendicular line from the COP through the image plane is the optical axis, usually the z-axis. The point on the image plane where the optical axis passes though the image plane is the principle point. To avoid a flipped image, we assume the image plane is in front of the COP. 22 Figure 3.1: Perspective Projection The expression for normalized projection is " # " # C Xi ũi 1 = C . Zi CYi ṽi (3.5) Note that this does not include any of the intrinsic camera parameters. Such points are referred to as “normalized” image points. Image coordinates are, of course, found using the cameras intrinsic parameters. The parameters result from both the camera hardware and lens but are simply referred to as camera parameters. If the lens is not fixed, such as a zoom lens, than some of these parameters will change with different lens settings. In most modern cameras the skew (s) is usually close to zero. If the camera has square pixels the aspect ratio (τ ) will equal 1, or fv = fu . Projection to image coordinates may be compactly written in two equations. These two equations are referred to as the perspective projection equations, Xi r1∗ MPi + tx ui = fu ũi + u0 = fu C + u0 = fu + u0 Zi r3∗ MPi + tz C r2∗ MPi + ty Yi vi = fv ṽi + v0 = fv C + v0 = fv + v0 . Zi r3∗ MPi + tz (3.6) C 23 (3.7) By collecting the intrinsic parameters into a matrix, τ f s u0 fu s u 0 K = 0 f v0 = 0 fv v0 , 0 0 1 0 0 1 (3.8) image coordinates can also be computed as ui ũi vi = K ṽi . 1 1 (3.9) The Matlab Camera Calibration toolbox is used for estimating the camera’s intrinsic parameters, fv , fv , u0 , v0 , k1 , k2 , p1 , p2 . Images of a calibration pattern, checker board, are taken from many different poses and then processed by Matlab. For each image, a grid is construct by segmenting the corners within the pattern. The intrinsic and extrinsic parameters are determined simultaneously in a large nonlinear optimization. The extrinsic parameters, pose of camera with respect to grid, are important later in the IMU calibration section at the end of the chapter. 3.1.2 Image Point Correction The pinhole model is only an approximation of real camera projection. While it is a useful model that enables simple mathematical formulation of the relationship between object and image coordinates, when high accuracy is required a more comprehensive camera model must be used. A normalized image point with second-order radial and tangential distortion terms added may be expressed as " # " # " # " # 2 2 ūi ũi ũ 2 p ũ ṽ + p (r + 2 ũ ) i 1 i i 2 i i ' + (k1 ri2 + k2 ri4 ) + v̄i ṽi ṽi p1 (ri2 + 2 ṽi2 ) + 2 p2 ũi ṽi . (3.10) ri2 = ũ2i + ṽi2 A raw image from the camera contains distortion, or warping. Typically, the effect of lens distortion is more noticeable in a wide angle lens. Unfortunately, the inverse mapping from 24 Figure 3.2: Cone Geometry distorted points to undistorted points, " # " # ūi ũi → , v̄i ṽi (3.11) is much more complicated and is usually done iteratively. We use the distortion model described within [15], [16], [38]. An entire image may be undistorted, or to save time, only a subset of feature points may be undistorted. 3.1.3 Traffic Cones as Fiducials The goal of many computer vision applications is automatic detection of features, often referred to as segmentation. Segmentation always relies on image processing techniques in some of the initial steps. Robust segmentation of complex features in unstructured environments is a very difficult task. As a result people try to use features that are very easy to segment. Corners, dots, and lines are a few examples. In augmented reality many people, such as ARToolKit, have used planar patterns that are like a calibration grid except these patterns must be more unique. Multiple patterns 25 Figure 3.3: Model Point Correction might be used in the same scene and viewed from many different orientations. However, distance and viewing angle of the camera are highly restricted. For those reasons, in our system we have decided to use 3-dimensional fiducials, orange traffic cones. The cones are easy to segment because of their unique shape and color. They also allow the system to be scalable because any size cone may be used. The geometry of the cones is very simple and shown in figure 3.2. Each cone has three points of interest along the major axis, a peak, middle, and base. Thus, a set of model points will be defined as M P∗ = h M 1 Ppeak M 1 Pmid M 1 Pbase ... M n Ppeak M n Pmid i n . Pbase M (3.12) We use three traffic cones in an equilateral triangle configuration where the three peak, middle, and base points define three parallel planes. Therefore, the major axes of the cones are parallel and perpendicular to these planes. 26 3.1.4 Model Point Correction In reality, the major axis of a cone is not visible. Depending on location, the camera views a specific side of the cone’s outer surface. The location of the camera with respect to the cone defines a line of sight. Imagine three vectors originating at a single point which end at a cones peak, middle, and base points along the outer surface. The camera does not see major axis and is why the model point correction is needed. The correct way of thinking about the problem is that the major axis (and peak, middle, and base points) is projected onto the surface. Given an angle about the major axis, the same three points on the cone’s surface may be computed. The short routine below and figure 3.3 explain how to correct the model points from the major axis to the surface given a predicted location of the camera. Assume that we are only given the location of the COP, peak and base points for each cone, and the radius to height ratio, in our case τ =1/7. The radius anywhere along the base is found by r = τ h. The height may be calculated by the peak and base points via i h = MPbase − i Ppeak . M (3.13) As stated earlier, the cones all sit on the same plane. The nearest point of the COP on the 0 base plane, COP , may be computed in a few easy steps, and any cone can be used for the calculation. The peak and base points also define a direction vector that is perpendicular to the base plane, M i i P − MPpeak ~h = base (3.14) MP i − MP i . base peak A vector from the COP to the base point defines a line of sight, li = M i Pbase − COP . (3.15) 0 COP may be calculated by projecting li onto ~h, i.e. 0 COP = COP + li · ~h ~h . (3.16) 0 Now, a base direction toward COP may be computed for each cone via 0 COP − ~ri = kCOP 0 − 27 M i Pbase . MP i k base (3.17) Figure 3.4: Cone Segmentation Finally, position adjusted model points may be calculated as M M i P̃base = 3.1.5 M i Pbase + r (~ri ) , M i P̃peak = M i Ppeak , M i P̃mid = i + P̃peak 2 M i P̃base . (3.18) Cone Segmentation The goal of the cone segmentation is to automatically identify the 2D coordinates of the peak, middle and base points of the cones within a single image. The segmentation procedure is the most expensive part of the entire pose estimation system but still may be performed at frame rate, 30 Hz, with a dual-core processor. The procedure may be broken down into five distinct steps (figure 3.4): color classification, connected components, blob analysis, template matching, and image point correction. 3.1.5.1 Color Classification The segmentation procedure relies heavily on the cone’s unique orange color which is not usually found in most environments. Lighting conditions will however affect the color that appears in an image. Robust color classification can tolerate normal light fluctuations. The input to the color classification block is an RGB image and the output is a binary image of cone pixels (1) and non-cone pixels (0). In other words, the output image should contain only silhouettes of the cones, figure 3.5 Each pixel in an 8-bit RGB image contains a red, green, and blue component that range from 0 to 255. Therefore, a total of 16.8 million RGB combinations exist. Imagine a cube with one of its corners lying at the origin and the three associated sides aligned with the axes, figure 3.6. Traveling along the axes will yield red, green, or blue in increasing brightness. Moving off of the axes yields a mix of the colors. Black exists at {0,0,0} and white at {255,255,255}. Traveling along the line from black to white results in gray scale because no single color dominants. This is consistent with an additive color model. 28 Figure 3.5: Cone Blob Figure 3.6: RGB Color Space 29 The hue, saturation, and intensity (HSI) space describes color in a more intuitive fashion. For example, most color palettes use HSI space. Hue describes the blend of color and is usually defined from 0 to 2π. Saturation defines the color purity or shade of gray between black and white and is usually defined between 0-100%. Intensity defines the brightness of the color and is usually defined between 0 - 100%. The HSI space looks like a cone instead of a cube as seen in figure 3.7. A nonlinear mapping exists between the two color spaces. For example, an RGB image may be converted to HSI by h= 0, π max = min g−b , 3 max − min π b−r + 3 max − min π r−g + 3 max − min s= 0, max = r 2π , 3 max = g 4π , 3 max = b max = 0 max − min , otherwise max v = max . (3.19) (3.20) (3.21) To define the color of an object a large sample of pixels may be collected in various lighting conditions. Several different methods of analysis may be used including a multivariate Gaussian approximation, nonlinear PCA, and histogram. Ideally, a small region will encapsulate most of the pixels. However, all materials have different light dispersion properties and may be more or less difficult to classify. If a color region is to be defined for classification, HSI space works much better than RGB. An alternative to the sampling approach is to allow the region to be manipulated manually in real-time. The user simply adjusts the six parameters within 1, (H < H < H ) ∩ (S < S < S ) ∩ (I < I < I ) L ij H L ij H L ij H Pixelij = 0, otherwise (3.22) and sees the result immediately. Under typical lighting conditions we have found that this calibration step usually does not require any attention because the default values always work very well. To speed up the classification process in the real-time system a huge RGB classification 30 Figure 3.7: HSI Color Space table is built. All of the 16.8 million RGB combinations are assigned a one or zero as cone and non-cone pixels. In this way, pixels within an RGB image may be classified directly instead of constantly converting to HSI and then classifying. Recall that the mapping between color spaces is nonlinear. 3.1.5.2 Connected Components Two-dimensional connectivity defines which pixels are connected to other pixels. A set of pixels in a binary image that form a connected group are called a connected component, region, or blob. “8-connected” pixels are connected if their edges or corners touch, figure 3.8. If two adjoining pixels are both ones, they are considered to be part of the same object, whether connected along the horizontal, vertical, or diagonal direction. There are many properties that can be helpful to describe a region. We are only interested in the following properties. • Area - the actual number of pixels in the region. 31 Figure 3.8: 8-Connectivity • Solidity - the proportion of the pixels in the convex hull that are also in the region. The convex hull is the smallest convex polygon that contains the region. Computed as region area divided by convex area. • Centroid - the center of mass of the region. • Major Axis Length - the length of the major axis of the ellipse that has the same normalized second central moments as the region. • Minor Axis Length - the length of the minor axis of the ellipse that has the same normalized second central moments as the region. • Orientation - the angle between the x-axis and the major axis of the ellipse. • Major-Minor Axis Ratio - the proportion of major axis to minor axis length. Computed as major axis length divided by minor axis length. • Bounding Box - the smallest rectangle that contains the region. Forming connected components and region properties can be very expensive if the image size is large and many artifacts exist. In this context, artifacts are small regions that are too small to possibly be a cone. To save time, it is important to reduce the search space by identifying a region of interest (ROI) that contains the fewest amount of artifacts as possible. The ROI is quickly found in a two step procedure. First, the image is down-sampled by pyramid decomposition, figure 3.9. Each level in the pyramid has a quarter of the number of pixels as the level below it. Down sampling occurs by simply rejecting even rows and columns. We only down-sample twice. Secondly, the (binary) down-sampled image is searched for the smallest rectangle that contains all of 32 Figure 3.9: Pyramid Decomposition the remaining cone pixels (1’s). The vertices of the rectangle in the down-sampled image correspond to locations within the full-size binary image. Typically, the ROI is about half the area of the full image. 3.1.5.3 Blob Analysis Blobs are either accepted or rejected by the properties described above. • Upper and lower bounds are placed on the area. The range is large enough to allow significant variation in camera viewing distance of the cones. • A border region is used to reject objects with centroids located near the edges of the image. Part of the object might not be visible and therefore its properties cannot be accurately determined. • Upper and lower bounds are placed on the solidity. The cones are uniform color and convex. Under ideal lighting conditions they have a solidity value very close to one. Shadows and low light environments can cause cone pixels to be misclassified; this causes the solidity to drop. Typically, solidity helps reject non-cone objects that have similar color but are not uniform and/or convex. 33 Figure 3.10: Cone with Ellipse • Upper and lower bounds are placed on the major-minor axis ratio. The cones used for testing are about 4 inches tall with 1 inch diameter base. Therefore, we expect this ratio to be somewhere around 4. If the blob properties are acceptable than a final line search is performed. The orientation and centroid define two direction vectors along the major axis. The vectors point in opposite directions and one is the negative reciprocal of the other. The peak and the base are both located somewhere along these vectors, as shown in figure 3.10. The search starts at the origin and moves along the major axis until three consecutive non-cone pixels are found. This typically works very well because the centroid and orientation are well defined properties of the cone shape. 3.1.5.4 Template Matching The line search, above, provides a good estimate of the cone’s height and orientation in the image. Template matching is performed to refine this estimate. The total number of correlations is restricted to nine because it is a very costly operation. Cost increases 34 Figure 3.11: Cone Templates with search area and template size. The nine templates include one that corresponds to the initially estimated height and orientation followed by eight neighboring templates. A neighboring template is one unit of height and/or orientation away from the initial estimate, figure 3.11 The search region is a rectangle that is equal or greater size than the bounding box of the blob. The templates are very simple and contain a single white triangle with black background, a perfect silhouette of the cones. The centroid of the triangle is located exactly at the image center. At program startup all of the templates are preloaded into memory and indexed by height and orientation. The template set include a total of (180/3)*((600-100)/3) = 10,000 binary images, but many are seldom used. A score is calculated at each pixel by P the cross-correlation c(u, v) = x,y I(x, y) T (x − u, y − v), where I is the image and the sum is taken over x,y within the window containing the template T positioned at u,v. 3.1.6 Correspondence The cones in the image correspond to specific cones in the model. Specifically, the the 3D model points of 3.12 are matched to a set of image points, " 1 1 n n n u1 umid ubase . . . upeak umid ubase c p = peak 1 1 1 n n n vpeak vmid vbase . . . vpeak vmid vbase 35 # . (3.23) The correspondence problem is difficult because the cones are all identical in appearance. Also, there is a chance that non-cone objects are detected and actual cones are not detected while in the the field of view. A predicted pose can be used to estimate where the cones should appear in the image. Many algorithms exist to solve the correspondence problem given noisy observations and a prediction. We only have three cones so a simple greedy algorithm is used. All possible combinations are enumerated and scored. The combination with the lowest score is kept if below a maximum distance threshold. 3.1.7 Pose Estimation Given the camera’s intrinsic parameters, pose may be determined from a set of 2D image points and corresponding 3D model points. The discussion of the problem is quite lengthly, and for this reason, it will not be covered until the next chapter. 3.2 Inertial Measurement Unit The IMU is a small 3-axis inertial sensor that provides high resolution angular velocity, I zgyro = ωimu and acceleration h iT = ωx ωy ωz , h iT zacl = Iaimu + Iagrav = ax ay az , (3.24) (3.25) data through a serial interface (RS232). MEMS gyroscopic rate sensors and accelerometers are mounted on orthogonal axes, figure 3.12. Because our goal is to estimate the pose of the camera with respect to the model, the IMU measurements are transformed into the camera coordinate system. The IMU is rigidly mounted to the camera so that the pose of the IMU with respect to the camera does not change. Although the translation between the two sensors is actually around ten centimeters, we set it equal to zero to simplify the math to follow. This means that the origins of the two coordinate systems are effectively coincident at a single point, PCorg ≡ PIorg . Craig [8] chapters 5 and 6 discusses the relationship of angular velocity and acceleration in different 36 Figure 3.12: IMU Coordinate System reference frames. For two coincident reference frames, the relationship may be simplified to C I ωcam ≡ C I R ωimu , 3.2.1 C acam + Cagrav ≡ C IR I aimu + Iagrav . (3.26) Static Bias Correction We assume that the gyros output a measurement that contains a bias and Gaussian zeromean noise, I 0 ωimu = Iωimu + bias + vgyro , (3.27) where vgyro ∼ N (0, σgyro ). Modeling the accelerometer noise is much more complicated and not covered here. The gyro bias seems to change by a small amount every time the unit is turned on and might even drift slightly during operation. A good correction scheme would constantly refine the biases during operation by using a complementary Kalman Filter, [29]. We call the correction “static” because of the way the bias is calculated. Many static angular velocity measurements are recorded at several different orientations. This could be done as a calibration step every time the system started. All axes of the gyros should read zero if the unit is motionless. When these static values are plotted, the histogram has the normal form with approximately the same non-zero mean at all orientations. Surprisingly, the bias on each axis can be very different. The static measurements also provide a measurement of uncertainty for the gyros. 37 After the bias is removed the variance for the measurement noise remains the same. An estimate of such uncertainty is important in later chapters. 3.2.2 Calibration of Relative Sensor Orientation The goal of this calibration procedure is to determine the relative orientation of the camera and IMU by using gravity vectors. For each static pose in the camera calibration, a large set of accelerometer measurements are recorded and then averaged. Under this motionless condition, the accelerometers provide a measure of gravity. In the calibration grid coordinate system gravity is defined to point in the downward direction. Therefore, the extrinsic parameters from camera calibration can be used to rotate gravity from the grid into the camera’s coordinate system via C i agrav G = C GRi agrav . (3.28) We are interested in the rotation matrix that aligns these two sets of vectors, C ∗ agrav = h C 1 agrav C n agrav ... i I ∗ agrav , = h I 1 agrav ... I n agrav i . (3.29) With a reasonable initial guess, a nonlinear minimization algorithm may be used by parameterizing the rotation matrix. C IR ∼ min E = α,β,γ C agrav IaTgrav I agrav IaTgrav −1 n X C I R(α, β, γ) Iaigrav − Caigrav 2 (3.30) (3.31) i=1 A superior approach uses singular value decomposition of a moment matrix. M= n X C i agrav T I i agrav (3.32) i=1 {U, S, V } = svd(M ) C IR = UV T (3.33) (3.34) No initial guess is required and the rotation matrix is not parameterized. More details of this relative orientation method are provided in the next chapter. 38 Chapter 4 Single Camera, Single Frame Pose The problem of determining pose from a single 2D-3D point set was introduced in the last chapter. Because the discussion of the problem is so lengthy, it has been given its own chapter. As a reminder, pose may be determined from a set of 2D image points with corresponding 3D model points given the camera’s intrinsic parameters. First, basic concepts and several classic algorithms are presented. Then, several experimental algorithms are presented as alternatives. 4.1 Types of Algorithms There is a long history of study in this area and many algorithms have been proposed. Difficulty arises due to image point noise, nonlinearity of perspective projection, and the constraint requirements of any 6-DOF pose parameterization. Linear algorithms can be very fast but are typically not as accurate, [12], [3], [31]. In general, they attempt to relax the pose constraints or move them outside of the image based minimization. Iterative algorithms constrain the pose directly and are usually very accurate, [24]. Unfortunately, they require a reasonably good initial guess for fast and proper convergence to the global minima. Qualities of the 2D-3D point set itself may be used to distinguish many algorithms from each other. In object space, the points may all lie on a plane, such as a calibration grid, or they may form a non-planar configuration which might be arbitrary or of specific shape. The number of points in the set is also very important. Some algorithms are designed for a specific number of points, such as 3, 4, and 5 while others work for an arbitrary number 39 Figure 4.1: Error Spaces of points. For an arbitrary number of points, the algorithm complexity is very important because the number of points might become quite large. Finally, the algorithm will operate in image or object space as seen in figure 4.1. 4.1.1 Image Space The majority of all algorithms operate in image space, meaning that the algorithm attempts to minimize a residual consisting of errors on the image plane, in two dimensions. Each image point has two error components and the total error, E, for the entire set is written as a sum of squares. 2 C 2 n C X Xi Yi E= fu C + u0 − ui + fv C + v0 − vi Zi Zi i=1 2 2 n X r1∗ MPi + tx r2∗ MPi + ty fu = + u0 − ui + fv + v0 − vi r3∗ MPi + tz r3∗ MPi + tz i=1 (4.1) The error equations are based on the perspective projection equations of chapter three. It is worth noting that the error is not a true orthogonal projection. 40 4.1.2 Object Space Fewer algorithms operate in object space for various reasons. One possible reason may be that people are simply unfamiliar with the concept because it is not directly obvious from the standard perspective projection equations. Image points are used to construct a projection matrix which performs an orthogonal projection of a 3D camera point onto the direction vector defined by the image point. The idea of applying the projection matrix to a computer vision algorithm was first introduced by Lu and Hager in the Orthogonal Iteration Algorithm [25]. The object space error function is defined as n X (I − Fi ) CPi 2 E= = i=1 n X (4.2) (I − Fi ) C M MR Pi 2 + tM , C i=1 where Fi is the projection matrix. It is constructed from “normalized” image points (chapter 3) so that wi w T (4.3) Fi = T i . wi wi Recall that such points are unit focal length with image offsets removed, i.e. u −u 0 i ũi fu 0 wi = ṽi = vif−v . v 1 1 (4.4) However, if the effective focal lengths are equal, fu = fv = f , construction of the projection matrix really only requires that the image offsets be removed. Further, normalized image points may already be available from the undistorting procedure of chapter 3. A quick derivation of the projection matrix, Fi , shows that it is a true orthogonal projection which uses the dot product as shown below. wi wiT C Pi kwi kkwi k wi wiT CPi ← Dot Product = kwi k kwi k = ŵi cos(θ) CPi Fi CPi = 41 (4.5) If wi and CPi are colinear then ŵi = CP̂i and θ = 0. Clearly we have an identity. If not colinear, an error exists, 2 2 ei = CPi − Fi CPi = CPi − ŵi cos(θ) CPi . (4.6) From the previous result it can be seen that 3D points lying further away of the camera are weighted more heavily than close points; in fact, the weight is proportional to distance. This may be considered an undesirable property that has deterred people from object space algorithms, but it is not much of a problem if all points have nearly the same depth. 4.2 Classic Algorithms Several classic algorithms are presented in this section for two reasons: (1) reinforcement of the different algorithm types discussed earlier, (2) establish a foundation for the experimental algorithms section to follow. 4.2.1 Perspective-3-Point - Fischler & Bolles A solution to the perspective-3-point problem was introduced by Fischler and Bolles in their classic RANSAC paper, [13]. It is one of the easiest algorithms to understand and visualize. In figure 4.2, the center of projection, L, and three object points, A, B, and C form a tetrahedron. The law of cosines yields three equations that completely describe the geometry of the tetrahedron, d2ab = a2 + b2 − 2ab cos(θab ) (4.7) d2ac = a2 + c2 − 2ac cos(θac ) (4.8) d2bc = b2 + c2 − 2bc cos(θbc ) . (4.9) Everything is known except the lengths of the tetrahedron legs, a, b, and c. The image points provide direction vectors to the model points in the camera frame, ŵa = wb wc wa , ŵb = , ŵc = . kwa k kwb k kwc k 42 (4.10) Figure 4.2: Perspective Tetrahedron The angles between the vectors may be easily computer by the dot product so that cos(θab ) = ŵa · ŵb , cos(θac ) = ŵa · ŵc , cos(θbc ) = ŵb · ŵc . (4.11) Additionally, the distances of the tetrahedron base are known from the model points. The three distances are computed as d2ab = k MPa − M Pb k2 , d2ac = k MPa − M Pc k2 , d2bc = k MPb − M Pc k2 . (4.12) Two of the leg lengths can be expressed in terms of the third leg. For example, b = x a, c = y a. Then, the original equations are rearranged so that d2ab = a2 + x2 a2 − 2a2 x cos(θab ) d2ac = a2 + y 2 a2 − 2a2 y cos(θac ) (4.13) d2bc = x2 a2 + y 2 a2 − 2a2 xy cos(θbc ) . The three equations can be reduced to a 4th-order polynomial in either x or y. For example, P4 x 4 + P3 x 3 + P 2 x 2 + P 1 x + P0 = 0 . 43 (4.14) Of the four possible roots, we are only interested in those that are positive and real. Length a is calculated for each valid root and then b and c are found next. There are at most eight possible solutions for the tetrahedron but only four of them can be in a valid position in front of the camera. The estimated location of the model points in the camera frame can be calculated from the leg lengths and direction vectors, C Pa = a ŵa , CPb = b ŵb , CPc = a ŵc . (4.15) Finally, the pose of the model with respect to the camera may be found by aligning the two congruent triangles in 3D, figure 4.3. An analytic solution can be found in [?]. Both triangles are first translated to the origin and then rotated to lie in the same plane with one side along a common axis. The alignment procedure requires three transformations (1 translation, 2 rotations) for each triangle. Results and illustration ( figure 4.3) provided below. " # " # C M Pi Pi = H6 H5 H4 H3 H2 H1 1 1 " # " # (4.16) M C P P i i W = W MH CH 1 1 C MH 4.2.2 = C W WH MH (4.17) Direct Linear Transform - Abdel-Aziz & Karara The perspective-three-point problem has some obvious shortcomings including sensitivity to noise and pose ambiguity. [42] provides a geometric explanation of configurations that lead to ambiguity. These are the two primary factors which motivate the use of additional points. 4 and 5 point algorithms exist, [18], [37], but we want an algorithm that can handle an arbitrary number of points, n ≥ 6. This is referred to as the perspective-n-point problem (PnP). The pose ambiguity problem is resolved (under weak assumptions), but the sensitivity to image noise remains. However, as the number of points increases, the sensitivity decreases. The Direct Linear Transform (DLT) algorithm was first introduced by Abdel-Aziz and Karara [2]. It is a non-iterative algorithm originally intended to find a linear transformation from object space to image space and estimate some of the camera’s intrinsic parameters. The linear 44 Figure 4.3: Congruent Triangle Alignment transformation is a combination of the homogeneous transform, K. The transformation of a single point is written as M ui − u0 r11 r12 r13 X i − X0 vi − v0 = ci r21 r22 r23 MYi − Y0 M −d r31 r32 r33 Zi − Z0 r11 MXi − X0 + r12 MYi − Y0 + r13 = ci r21 MXi − X0 + r22 MYi − Y0 + r23 r31 MXi − X0 + r32 MYi − Y0 + r33 C MH, and camera matrix, Zi − Z0 M Zi − Z0 . M Zi − Z0 M (4.18) The scale factor, ci , is isolated from the last line of the equation above, ci = −d r31 ( MX i − X0 ) + r32 ( MY i − Y0 ) + r33 ( MZi − Z0 ) , (4.19) and terms are introduced for focal length and aspect ratio, fu = −d −d , fv = . λu λv 45 (4.20) The variables above are substituted into the original equation 4.18. Now only the top two rows are of interest. Specifically, r11 ( MXi − X0 ) + r12 ( MYi − Y0 ) + r13 ( MZi − Z0 ) − u0 r31 ( MXi − X0 ) + r32 ( MYi − Y0 ) + r33 ( MZi − Z0 ) (4.21) r21 ( MXi − X0 ) + r22 ( MYi − Y0 ) + r23 ( MZi − Z0 ) − v0 . r31 ( MXi − X0 ) + r32 ( MYi − Y0 ) + r33 ( MZi − Z0 ) (4.22) ui = fu vi = fv Finally, the DLT parameters, l1 . . . l12 , are introduced by grouping terms in the equation above. In this form, a model point can be projected to its location in image coordinates given the eleven parameters. ui = l1 MXi + l2 MYi + l3 MZi + l4 l9 MXi + l10 MYi + l11 MZi + 1 (4.23) vi = l5 MXi + l6 MYi + l7 MZi + l8 l9 MXi + l10 MYi + l11 MZi + 1 (4.24) The equations are rearranged once more into a set of linear error equations, ui = l1 MXi + l2 MYi + l3 MZi + l4 − ui (l9 MXi + l10 MYi + l11 MZi ) (4.25) vi = l5 MXi + l6 MYi + l7 MZi + l8 − vi (l9 MXi + l10 MYi + l11 MZi ) , (4.26) which may be written in matrix form as " Mi = M Xi 0 M Yi 0 M Zi 1 0 0 0 M Xi 0 M Yi 0 0 −ui MXi − ui MYi − ui MZi M Zi 1 −vi MXi − vi MYi − vi MZi # . (4.27) ~ equal the vector of transform coefficients, Let L h ~ = l1 . . . L so that l11 " # ~ = ui . Mi L vi 46 iT (4.28) (4.29) A large linear system is constructed from the error equations for the entire point set, M1 h . .. , W = u1 v1 . . . M = Mn un vn iT . (4.30) The projection theorem is used to calculate the least-squares solution of the DLT parameters. ~ = (M T M )−1 M T W L (4.31) Note that a minimum of six points are required to solve for the eleven unknown parameters. If many images are available and used in the same optimization, some of the camera’s intrinsic parameters may be estimated from the DLT parameters. However, for a single image, the intrinsic parameters must be known a priori and the image points should be normalized before constructing the Mi matrices. When calculating pose, the parameters are “clamped” to the correct values, fu − fv = 1 and u0 = v0 = 0. A list of all the DLT parameters and connection to the original transformation are provided in the appendix. The camera orientation is calculated directly from the DLT parameters via 0 l9 −l1 fu u C MR = D v0 lf9v−l5 u0 l10 −l2 fu v0 l10 −l6 fv l9 l10 where D2 = u0 l11 −l3 fu v0 l11 −l7 fv 1 l92 + 2 l10 2 + l11 , (4.32) l11 . (4.33) It is important to recognize that no orthogonality constraint is placed on the elements of the rotation matrix. With even relatively small amounts of noise, the matrix can be far from orthogonal. This is a big problem! Orthogonalizing a matrix is not a trivial task. The camera position is also calculated from the DLT parameters via −1 X0 l1 l2 l3 −l4 Y0 = l5 l6 l7 −l8 . Z0 l9 l10 l11 −1 (4.34) However, equation 4.18 is not consistent with the standard convention for transformation of 47 a point. Specifically, a point should be rotated and then translated, not the opposite order. Therefore, the correct position requires a rotation by the matrix found above, which might not be orthogonal. X0 C tm = − C (4.35) MR Y0 Z0 4.2.3 “Absolute Orientation” - Horn The problem of aligning a set of vectors expressed in two different coordinate systems has been given the name “Absolute Orientation”. This is extremely useful for many different tasks, such as aligning gravity vectors in the IMU calibration procedure discussed in chapter three. Another example was seen in the last step of the P3P problem, alignment of congruent triangles. Notice that the triangle alignment problem is slightly different because translation is included as well. For many computer vision algorithms absolute orientation is a fundamental step. A set of 3D points are first estimated in the camera frame and then aligned with the same point set expressed in the model. Of course, the task is complicated by noise; the size and shape of the two point clouds might be slightly different. Two popular methods were introduced by Horn et al [19],[20] that produce similar results. They differ in the way that rotation is calculated. Both separate the calculation of rotation and translation into two steps and require calculation of the centroid of both point clouds. The central idea is that the vectors created from the centroid to all of the points in the set provide directional information. In a noiseless system, the centroid or any of the points in the set could be used, but with noise added the centroid has been shown to be the optimal choice. Calculation of both centroids is straightforward and relatively fast. n 1X P̄ = n i=1 M n M Pi , 1X C P̄ = Pi n i=1 C (4.36) However, if absolute orientation is to be performed repeatedly, such as within an iterative algorithm, to save time it is logical to store the value of the centroid in the model frame so that it is only calculated once. Translation may be calculated directly from the rotation and both centroids. C M tM = CP̄ − C (4.37) MR P̄ 48 4.2.3.1 Optimal Rotation - SVD The SVD method is somewhat intuitive. The principle directions of a moment matrix, M= n X C Pi − CP̄ M Pi − M P̄ T , (4.38) i=1 are found through singular value decomposition, {U, S, V } = svd(M ) , (4.39) and then the (orthogonal) rotation is recovered via C MR 4.2.3.2 =UVT . (4.40) Optimal Rotation - Quaternion A problem can occur in the SVD method when a significant level of noise exists. As a result the orthogonal matrix of U V T may have a determinant equal to −1 which indicates it is not special orthogonal (chapter 2). [40] provides a way to detect and fix the problem. The quaternion method always produces a proper rotation matrix, but it is slightly more expensive. A 4-dimensional moment matrix is constructed from multiplication of right and left sided skew symmetric matrices, M= n X i=1 " QL 1 C Pi − CP̄ #∗ ! " QR 1 M Pi − #! M P̄ , (4.41) and then decomposed into a set of eigenvectors and eigenvalues, V = eig(M ) . (4.42) The eigenvector with largest eigenvalue is the quaternion representing the rotation, C MR = R(V∗4 ) . 49 (4.43) 4.2.4 Orthogonal Iteration Algorithm - Lu & Hager The Orthogonal Iteration Algorithm [25] is based on the absolute orientation concept and the projection matrix, F , of equation 4.5. The algorithm is fast and globally convergent (under weak assumptions). Lu and Hager introduced the idea of applying the projection matrix to a computer vision algorithm, but even more novel is the proof of convergence within a derivative-free minimization of the object space residual of equation 4.2. The algorithm is very simple and operates by alternating between optimal translation and optimal rotation calculations. After expanding the object space error function of 4.2, E= n X (Fi − I) C M MR Pi + i=1 n X (Fi − I) CtM , (4.44) i=1 derivation of the optimal rotation is obvious. n X C tM − i=1 n X i=1 n CtM − i=1 n X C n X C i=1 n X Fi tM = ! Fi 1 I− n 1 C tM (R) = n n X tM = ! n CtM = Fi i=1 n X M (Fi − I) C MR Pi M (Fi − I) C MR Pi (4.45) M (Fi − I) C MR Pi i=1 i=1 n 1X I− Fi n i=1 !−1 n X ! M (Fi − I) C MR Pi (4.46) i=1 This leads to an estimate of 3D camera point locations which uses the project matrix, n C Pi (t) = Fi C M MR Pi C + tM 1XC Pi (t) . , P̄ (t) = n i=1 C (4.47) The optimal rotation can be found using either of the two absolute orientation methods above. The SVD method is more intuitive and provided below for clarity. M (t) = n X C C Pi (t) − P̄ (t) i=1 50 M Pi − M P̄ T (4.48) {U, S, V } = svd(M (t)), C MR = UV T (4.49) The authors use the “Global Convergence Theorem” to show that a solution will eventually be reached from an arbitrary starting point. Although global convergence does not guarantee that the true pose will always be recovered, it does suggest that the true pose can be reached from a very broad range of initial guesses. 4.3 Experimental Algorithms Several experimental algorithms have been created as alternatives to the algorithms presented previously. The algorithms have a linear non-iterative stage followed by an nonlinear iterative stage for refinement. Additionally, the algorithms operate by creating a null space for which the pose parameterization lives. For example, the image space error equations 4.1 are rearranged to form a null space for points in the camera, eiu = fu CXi + (u0 − ui ) CZi (4.50) eiv = fv CYi + (v0 − vi ) CZi . (4.51) Similarly, the object space error equations 4.2 are rearranged to form a null space for points in the camera, eix = CXi − f1∗ CPi = (f11 − 1) CXi + f12 CYi + f13 CZi eiy = CYi − f2∗ CPi = f21 CXi + (f22 − 1) CYi + f23 CZi eiz = CZi − f3∗ CPi = f31 CXi + f32 CYi + (f33 − 1) CZi . 4.3.1 (4.52) (4.53) (4.54) Modified DLT + Gauss-Newton The Modified Direct Linear Transform (MDLT) algorithm has been created to overcome some of the short comings of the DLT method presented earlier. Unlike the DLT, the algorithm is 51 designed to find pose only and not estimate any of the camera’s intrinsic parameters. The algorithm consists of two stages: form an initial guess and then iterate using Gauss-Newton. 4.3.1.1 Linear Transformation The initial guess is found by forming a linear system and relaxing the rotation constraints of the pose. The unconstrained transform will simply be referred to as a linear transform, L, so that " # " # C M P∗ P∗ =L . (4.55) 1 1 The ordering of the individual coefficients of the linear transform, l1 . . . l12 , is shown through the transformation of a single point, M Xi Xi l1 l2 l3 l4 C Yi l5 l6 l7 l8 MYi = CZ l l MZ . l l i i 9 10 11 12 1 1 0 0 0 1 C (4.56) The equation is simply expanded to show how model points in the camera frame are expressed in terms of the linear transform, C Xi = l1 MXi + l2 MYi + l3 MZi + l4 C Yi = l5 MXi + l6 MYi + l7 MZi + l8 (4.58) Zi = l9 MXi + l10 MYi + l11 MZi + l12 . (4.59) C 4.3.1.2 (4.57) Image Space System The linear transform coefficients are substituted into the null space equations of 4.50 which yields eiu = fu (l1 MXi + l2 MYi + l3 MZi + l4 ) + (u0 − ui )(l9 MXi + l10 MYi + l11 MZi + l12 ) 52 (4.60) eiv = fv (l5 MXi + l6 MYi + l7 MZi + l8 ) + (v0 − vi )(l9 MXi + l10 MYi + l11 MZi + l12 ) . (4.61) Notice that the equations are linear in the coefficients and may therefore be written in matrix form, # " Mai 0 Mbi , (4.62) Mi = 0 Mci Mdi where h i Mai = fu MXi fu MYi fu MZi fu Mbi (4.63) h i M M M = (u0 − ui ) Xi (u0 − ui ) Yi (u0 − ui ) Zi (u0 − ui ) (4.64) h i M M M = f v Xi f v Yi f v Z i f v (4.65) and Mci h i Mdi = (v0 − vi ) MXi (v0 − vi ) MYi (v0 − vi ) MZi (v0 − vi ) . (4.66) The equations for the entire point set are used to construct a large matrix referred to as the ~ equal the vector of linear transform coefficients, “measurement matrix”. Let L h ~ L = l1 . . . so that l12 iT " # M1 i e ~ = u , M = ... . Mi L eiv Mn (4.67) (4.68) The large matrix M creates a null space for the linear transform coefficients so that ~ ∈ R12 | M L ~ = 0} . V = {L 4.3.1.3 (4.69) Object Space System For several reasons it might be desirable to also have a measurement matrix constructed in object space. First, we would like to be able to test the convergence properties of the two error spaces. Maybe the Orthogonal Iteration Algorithm converges well because it minimizes an object space residual. Secondly, it is logical that the initial guess into an iterative algorithm be from a minimization that occurred in the same error space. The linear 53 transform coefficients are substituted into the null space equations of 4.52 which yields eix = (f11 − 1)(l1 MXi + l2 MYi + l3 MZi + l4 ) + f12 (l5 MXi + l6 MYi + l7 MZi + l8 ) (4.70) + f13 (l9 MXi + l10 MYi + l11 MZi + l12 ) eiy = f21 (l1 MXi + l2 MYi + l3 MZi + l4 ) + (f22 − 1)(l5 MXi + l6 MYi + l7 MZi + l8 ) (4.71) + f23 (l9 MXi + l10 MYi + l11 MZi + l12 ) eiz = f31 (l1 MXi + l2 MYi + l3 MZi + l4 ) + f32 (l5 MXi + l6 MYi + l7 MZi + l8 ) (4.72) + (f33 − 1)(l9 MXi + l10 MYi + l11 MZi + l12 ) . Therefore, the object space measurement matrix has 3 n rows whereas the image space matrix has 2 n rows. 4.3.1.4 Null Space Solution Regardless of which error space is used, the null space solution may be computed by eigenvalue decomposition, V = eig(M T M ) . (4.73) The multiplication of M T M can be a rather expensive operation when the number of points is very large. If noise is present, there will not be a strictly zero eigenvalue. Instead, we are only interested in the eigenvector, V∗1 , with the smallest eigenvalue so that L̃ = reshape(V∗1 , 3, 4) . (4.74) It should be obvious that an arbitrary (nonzero) scaling exists because scalar multiplication of any vector within the null space still lies in the null space, i.e. kβ M V∗1 k2 = β k M V∗1 k2 = k M V∗1 k2 = 0. However, in this context, the scale factor, β is not arbitrary and must be determined for the algorithm to work properly. Specifically, the scale is needed to compute an estimate of the model points in the camera frame within P∗ = β L̃ MP∗ = β L̃ MP∗ = β CS∗ , C 54 (4.75) because both point clouds should be approximately the same shape and size. Fortunately, the scale may be recovered easily, in a least-squares sense. Let CS∗ equal the set of improperly scaled model points in the camera frame. The centroid of each set is computed by n 1XC S̄ = Si , n i=1 C n 1XM P̄ = Pi , n i=1 M (4.76) and then the sum of squared distance about the centroid is calculated via n X C Si − CS̄ 2 , d= C n X M Pi − d= M 2 P̄ . M (4.77) i=1 i=1 The scale factor is the square root of the distance ratio, r C β = sgn( Z̄) Md Cd , (4.78) where the sgn( CZ̄) is needed to correct for camera points that might lie behind the camera, in the negative Z direction instead of positive. After the scale factor is recovered, the absolute orientation concept is applied to find the relative transformation between the point sets, " # P∗ = 1 " C C MH M P∗ 1 # . (4.79) Note, in this way the resulting rotation is forced to be orthogonal unlike in the standard DLT method. 4.3.1.5 Final Optimization The pose solution above provides a good initial guess that was found non-iteratively in a linear fashion. The initial guess could be used in any iterative algorithm but it makes sense to use it within the framework that has already been developed. Recall that the moment matrix, M , will be of size 2 n x 12 or 3 n x 12. The matrix can actually be decomposed into a reduced size of 12 x 12. From SVD we obtain M T M = (V SU T )(U SV T ) = (V S)(SV )T = M̂ T M̂ . 55 (4.80) This is a very significant reduction if the number of points is large. QR decomposition is perhaps a quicker way to compute the reduced measurement matrix, M̂ . It can be shown, [14], that minimizing a residual with the reduced measurement matrix is the same as minimizing a residual with the full size matrix, ~ = min kM Lk ~ . min kM̂ Lk ~ L (4.81) ~ L Sub-optimality of the non-iterative linear solution might be attributed in part to the relaxation of the orthogonality constraint. Specifically, the linear transform is allowed to have 12-DOF. Fortunately, it is easy to restrict the degrees of freedom in an iterative algorithm by parameterizing the pose (linear transform above) as covered in chapter two, R(Θ) t(Θ) L(Θ) = 0 1 . (4.82) For example, Euler angles and three translation elements might be used so that h i Θ = α β γ tx ty tz . (4.83) Now the transform parameters, l1 . . . l12 , only have 6-DOF, h ~ L(Θ) = l1 . . . l12 iT . (4.84) A standard Gauss-Newton algorithm may be used and will have good convergence properties if the initial guess is in the neighborhood of the global minima. ~ f (Θ) = M̂ L(Θ) Jf (Θ) = ∂f1 (Θ) ∂Θ1 .. . ∂f12 (Θ) ∂Θ1 (4.85) ... .. . ∂f1 (Θ) ∂Θ6 ... ∂f12 (Θ) ∂Θ6 Θk+1 = Θk − Jf (Θk )T Jf (Θk ) .. . −1 Jf (Θk )T f (Θk ) (4.86) (4.87) Note, by using the reduced measurement matrix, the Jacobian computed each iteration will always be of size 12 x 6 no matter how many points are in the full set. A significant time reduction occurs if many points and iterations are used. 56 4.3.2 3-Simplex + Gauss-Newton The idea of using a 3-simplex in a computer vision algorithm was first introduced by MorenoNoguer et al [27] in 2007. Many people are not familiar with the simplex and the connection to a computer vision algorithm was novel. A quick overview of the 3-simplex is covered to provide a foundation for the algorithm to follow. 4.3.2.1 3-Simplex and Barycentric Coordinates The previous notation for model points is modified slightly, M Pmi = h M i Xm M Ymi M i Zm iT , Pm∗ = M h M Pm1 M Pmn ... i , (4.88) by attaching the “m” subscript to distinguish it from vertices of the 3-simplex, “s”, M Psi = h M Xsi M Ysi M Zsi iT , Ps∗ = M h M Ps1 M Ps2 M Ps3 i Ps4 . M (4.89) Barycentric coordinates describe the location of points with respect to the vertices of the simplex, # # " " M ∗ M ∗ Ps Pm α. (4.90) = 1 1 They may be quickly computed in a single matrix operation, " α= Ps∗ 1 M #−1 " Pm∗ 1 M # . (4.91) Vertices that are coplanar create a singular matrix and should therefore not be used. When the average of the barycentric coordinates is applied to the simplex vertices, the centroid (or barycenter) of the point set is found. The utility of the simplex in the context of a computer vision algorithm is that the relationship between vertices, points, and barycentric coordinates is rotation and translation invariant, as the following proof suggests. " # Pm∗ = 1 " C C MH # Pm∗ = 1 " M C MH 57 # " # C ∗ Ps∗ Ps α= α 1 1 M (4.92) This expression shows the relationship of the barycentric coordinates and simplex vertices to a single point i C 1 Xm Xs C i C 1 Ym Ys CZ i = CZ 1 m s 1 1 C C Xs2 C 2 Ys C 2 Zs 1 C Xs3 C 3 Ys C 3 Zs 1 Xs4 αi1 C 4 Ys αi2 . C 4 Zs αi3 1 αi4 C (4.93) The expression is simply expanded and regrouped into sums so that C i Xm = CXs1 + αi2 CXs2 + αi3 CXs3 + αi4 CXs4 = 4 X (4.94) αij CXsj j=1 C Ymi = CYs1 + αi2 CYs2 + αi3 CYs3 + αi4 CYs4 = 4 X (4.95) αij CYsj j=1 C i Zm = CZs1 + αi2 CZs2 + αi3 CZs3 + αi4 CZs4 = 4 X . αij CZsj (4.96) j=1 Therefore, what may have seemed like a redundancy before is actually a constraint, i.e. the sum of each barycentric coordinate set must equal +1. αi1 + αi2 + αi3 + αi4 = 4 X αij = 1 (4.97) j=1 In the MDLT method the coefficients of a linear transform are estimated, but in this algorithm, the vertices of the simplex in the camera frame are estimated. The equations above show how the model points may be expressed in terms of the simplex vertices and barycentric coordinates. 58 4.3.2.2 Measurement Matrices The projection null space equations 4.50 and 4.52 are used again so that a linear system is created to minimize a residual in either image or object space. Like in the MDLT method, a measurement matrix, M , is constructed from smaller Mi matrices. Let CP~s equal the vector of simplex vertices in the camera, C~ Ps = h C Xs1 ... C 4 Zs iT . (4.98) This time the vector of simplex vertices should lie in the null space, V = { CP~s ∈ R12 | M CP~s = 0} . (4.99) The simplex vertices and barycentric coordinates are substituted into the null space equations of 4.50 which yields eiu = fu 4 X αij CXsj + (u0 − ui ) j=1 = 4 X 4 X αij CZsj j=1 (4.100) fu αij CXsj + (u0 − ui )αij CZsj j=1 eiv = fv 4 X αij C Ysj + (v0 − vi ) j=1 = 4 X 4 X αij CZsj j=1 fv αij C Ysj + (v0 − vi )αij CZsj (4.101) . j=1 The equations are linear in the vertices and may be written in the form of a measurement matrix " # fu αi1 0 (u0 − ui )αi1 . . . fu αi4 0 (u0 − ui )αi4 Mi = (4.102) 0 fv αi1 (v0 − vi )αi1 . . . 0 fv αi4 (v0 − vi )αi4 so that " # ei Mi CP~s = ui . ev 59 (4.103) The simplex vertices and barycentric coordinates are substituted into the null space equations of 4.52 which yields eix = (f11 − 1) 4 X αij C Xsj + f12 4 X j=1 = 4 X αij C Ysj + f13 j=1 4 X αij CZsj j=1 (4.104) (f11 − 1) αij CXsj + f12 αij CYsj + f13 αij CZsj j=1 eiy = f21 4 X αij C Xsj + (f22 − 1) 4 X j=1 = 4 X αij C Ysj + f23 j=1 4 X αij CZsj j=1 (4.105) f21 αij CXsj + (f22 − 1) αij CYsj + f23 αij CZsj j=1 eiz = f31 4 X αij C Xsj j=1 = 4 X + f32 4 X αij C Ysj + (f33 − 1) j=1 4 X αij CZsj j=1 (4.106) f31 αij CXsj + f32 αij CYsj + (f33 − 1) αij CZsj . j=1 Therefore, the object space measurement matrix has 3 n rows whereas the image space matrix has 2 n rows. 4.3.2.3 Null Space Solution The projection null space is found in the same way as in the MDLT algorithm, V = eig(M̂ T M̂ ), and the single eigenvector result is very similar P̃s∗ = reshape(V∗1 , 3, 4) . C (4.107) Again, an unknown scale factor, β, exists within Pm∗ = β CP̃s∗ α = β C ∗ P̃s∗ α = β CSm . C (4.108) The scale factor is recoved like in the MDLT, and an estimate of model points in the camera frame is produced. Finally, the relative transformation is computed using the absolute orientation method. 60 4.3.2.4 Final Optimization The linear non-iterative solution for the simplex vertices suffers from the same problem as the MDLT; the simplex vertices are given 12-DOF even though they really only have six. For example, distances between all vertex pairs should be the same in the model and camera frames. Six constraints are placed on twelve parameters leaving only 6-DOF. Instead of using an arbitrary simplex it is logical to choose a specific shape, such as a unit length right-handed orthogonal triad. One possible method of constructing such a triad is by decomposing the model point set. The triad origin is placed at the centroid, n 1XM i P̄m = Pm , n i=1 M (4.109) and the legs are made to point in the principle directions via {U, S, V } = svd M SR h M PmT − 1 ... h i h = V1 V2 ±V3 = MX̂s iT 1 M P̄mT M M Ŷs Ẑs (4.110) i (4.111) such that det(R) = +1. If chosen like this another reference frame, {S}, has been effectively created, M M P̄m M SR . (4.112) SH = 0 1 More analysis is needed to determine if this is the optimal selection of the simplex. Four points are required to define an orthogonal triad which can move freely in space. One point defines the position and the other three define the directions along the legs. The following expression shows how the rotation and translation of the homogeneous transform above can be used to construct an orthogonal triad. Ps∗ M = h M P̄m M P̄m + M X̂s M P̄m + M Ŷs M P̄m + M Ẑs i (4.113) Therefore, the orthogonal triad has the following properties when expressed in any reference frame. Ps2 − Ps1 · Ps3 − Ps1 = Ps2 − Ps1 · Ps4 − Ps1 = Ps3 − Ps1 · Ps4 − Ps1 = 0 61 (4.114) 2 Ps − Ps1 = Ps3 − Ps1 = Ps4 − Ps1 = 1 (4.115) Euler angles may be used to parameterize the rotational components of the traid. R(α, β, γ) = C SR (4.116) h iT Ps1 = tx ty tz (4.117) h iT Θ = α β γ tx ty tz (4.118) C The Gauss-Newton algorithm for the 3-Simplex is identical in form to the MDLT. Of course, the measurement matrix and parameterization are different but the number of degrees of freedom is the same. f (Θ) = M̂ CP~s (Θ) (4.119) If a unit length orthogonal triad is used than the final pose solution may be quickly computed. If it is not, absolute orientation will likely need to be performed. C MH 4.3.3 = C S SH MH = M SR 1 0 C 1 C Ps SR 0 M Ps1 1 −1 (4.120) 3-Simplex SDP The linear (non-iterative) solutions of the MDLT and 3-simplex typically provide an initial guess which is good enough for an iterative algorithm to converge properly to the global minima. However, as noise levels increase this might not be a valid assumption. The algorithm presented here is a technique to produce a more accurate result from the same framework. However, it is iterative and inevitably slower than the linear solutions. The algorithm uses the 3-simplex, multiple eigenvectors, and a polynomial semidefinite program (SDP). The linear transform of the MDLT is not applicable because of the way the degrees of freedom are spread throughout the twelve parameters. Specifically, six constraints are placed on the nine rotational parameters while the three translational parameters are left somewhat disconnected or unconstrained. The 3-simplex is different because six constraints are placed over all twelve parameters together so that rotation and translation are tightly coupled. Recall from earlier that only the eigenvector from V = eig(M̂ T M̂ ) with the smallest 62 eigenvalue was of interest. Only in a noiseless system will there be a perfectly zero eigenvalue. However, with noise, all of the eigenvectors provide “some” information in decreasing order of significance as the size of the eigenvalue increases. This algorithm attempts to use the first two, three, or four eigenvectors, va = reshape(V∗1 , 3, 4) , vb = reshape(V∗2 , 3, 4) (4.121) vc = reshape(V∗3 , 3, 4) , vd = reshape(V∗4 , 3, 4) . in a linear combination. Up to four scale factors, βa , βb , βc , and βd , are used within the vector combinations P̃s∗ = βa va + βb vb C P̃s∗ = βa va + βb vb + βc vc C (4.122) P̃s∗ = βa va + βb vb + βc vc + βd vd . C The betas are calculated with non-linear shape constraints that could not be imposed during the eigenvalue decomposition. A linear combination of vectors within the null space produces another vector which also lies in the null space. i.e. 0 = M (βa va ) = M (βb vb ) = M (βc vc ) = M (βd vd ) = M (βa va + βb vb + βc vc + βd vd ). An estimated set of model points in camera frame can be computed for each combination, Pm∗ = CP̃s α , C (4.123) and then the absolute orientation method is applied for the relative transformation. 4.3.3.1 Distance Constraint The polynomial SDP relies on (quadratic) constraints that are rotation and translation invariant. Euclidean distance between vertex pairs is a logical choice, and a set of four 4! = 6 possible combinations. In a noiseless system, distances in the model vectices has (4−2)! 2! and camera frames should be equal, C i C j 2 M i Ps − Ps = P s − 2 Psj . M (4.124) However, with noise present, an algorithm can be used to minimize any error by comparing distances in both frames for all l = 1 . . . 6 vertex combinations. The two variable/vector 63 case is provided as an example. 2 dl = (βa vai + βb vbi ) − (βa vaj + βb vbj ) 2 = βa (Xai − Xaj ) + βb (Xbi − Xbj ) 2 + βa (Yai − Yaj ) + βb (Ybi − Ybj ) 2 + βa (Zai − Zaj ) + βb (Zbi − Zbj ) C dl = MPsi − M 4.3.3.2 (4.125) 2 Psj M (4.126) ”Linearization” + Gauss-Newton In Moreno-Noguer et al [27], a two step procedure is suggested for finding the correct scale factors, betas, of the eigenvector combination. In the first step, the six distance constraints are used in a linear system where all combinations of betas are decoupled and treated as independent variables. C β = Md (4.127) −1 T M C d (4.128) β = CT C This is problematic and becomes ill-conditioned for more than three variables. The two variable/vector case is provided here as an example. C l l l dl = C20 βa2 + C02 βb2 + C11 βa βb 2 1 1 1 C11 C02 C20 βa . . . . .. .. , β = βb2 C= , . 6 6 6 βa βb C20 C02 C11 (4.129) d1 . M . d= . M d6 M (4.130) The second step is a Gauss-Newton algorithm that uses the results of the first step as an initial guess. Because of the way the linearization operates, it will often produce an initial guess that is very far from the global minima. Hence, the Gauss-Newton algorithm is likely to diverge. 2 fl (β) = Cdl − Mdl (4.131) −1 β k+1 = β k − JF (β k )T JF (β k ) JF (β k )T F (β k ) (4.132) 64 Again, the two variable/vector case is provided as an example. l l l βa βb − βb2 + C11 βa2 + C02 fl (βa , βb ) = C20 M dl ∂f1 (βa ,βb ) " # f1 (βa , βb ) ∂β. a βa .. .. β= , F (β) = . , JF (β) = βb ∂f6 (βa ,βb ) f6 (βa , βb ) ∂βa 4.3.3.3 2 (4.133) ∂f1 (βa ,βb ) ∂βb .. . ∂f6 (βa ,βb ) ∂βb (4.134) Dot Product Constraint The dot product is another rotation and translation invariant quadratic constraint. Unlike distance, the dot product constraint requires a vertex triplet, and a set of four vectices has 4! = 12 possible combinations. In a noiseless system, the direction cosine that the dot (4−3)! 2! product represents should be equal in the camera and model frames, C Psi − CPsk · C Psj − CPsk = M Psi − M Psk · M Psj − M Psk . (4.135) Dot products are computed for all l = 1 . . . 12 sets. The two variable case is provided as an example. dl = (βa vai + βb vbi ) − (βa vak + βb vbk ) · (βa vaj + βb vbj ) − (βa vak + βb vbk ) = βa (Xai − Xak ) + βb (Xbi − Xbk ) βa (Xaj − Xak ) + βb (Xbj − Xbk ) + βa (Yai − Yak ) + βb (Ybi − Ybk ) βa (Yaj − Yak ) + βb (Ybj − Ybk ) + βa (Zai − Zak ) + βb (Zbi − Zbk ) βa (Zaj − Zak ) + βb (Zbj − Zbk ) C M dl = 4.3.3.4 M Psi − M Psk · M Psj − M Psk (4.136) (4.137) Orthogonal Triad Constraints If the simplex is an orthogonal triad (previous section), a specific set of constraints may be used to define its shape. The set of six constraints includes three distances, C Ps2 − C Ps1 · C d2 = C Ps2 − C Ps1 · C d3 = C Ps3 − C Ps1 · d1 = C 65 C Ps3 − C Ps1 C Ps4 − C Ps1 C 1 Ps4 − C Ps (4.138) M d1 = M d2 = M d3 = 0 , (4.139) and three dot products 2 d4 = C Ps2 − C Ps1 2 C d5 = C Ps3 − C Ps1 2 C d6 = C Ps4 − C Ps1 C M d4 = 4.3.3.5 M d5 = M d6 = 1 . (4.140) (4.141) Polynomial Semidefinite Program An important observation is made regarding the quadratic constraints with respect to the betas. Specifically, the sum of squared distance or dot product error in the camera and model frames is in fact a multivariate polynomial in the betas. E= 6 X C dl (β) − M dl 2 (4.142) l=1 Lasserre [23] shows how minimizing a polynomial that is a sum-of-squares can be transformed into a convex linear matrix inequality (LMI) problem and then efficiently solved in a semidefinite program (SDP). Hence, the name that was given to the algorithm, “Simplex SDP”. A polynomial is a finite linear combination of monomials with real-valued coefficients. If all of the monomials are of the same degree, the polynomial is called homogeneous. A multivariate polynomial is a sum-of-squares if it can be written as a sum of squares of other polynomials. In general, sum-of-squares decompositions are not unique. A deceptively simple sufficient condition for a real-valued polynomial to be globally nonnegative (convex) is the existence of a sum of squares decomposition. SDP is a broad generalization of linear programming (LP), to the case of symmetric matrices. A linear objective function is optimized over linear equality constraints with a semidefinite matrix condition. The LMI minimization problem is written in standard form as min E = P T X , such that M ≥ 0 (4.143) X where E is the objective function and M is the semidefinite matrix of the linear inequality 66 constraint. The key feature of SDPs is their convexity, since the feasible set defined by the constraints is convex. In this context, the SDP is used to solve convex LMI relaxations of multivariate real-valued polynomials. Specifically, the quadratic shape constraints on the eigenvector combinations are decomposed into monomials of the betas when written as a sum of squares. The Gauss-Newton algorithm of Moreno-Noguer et al [27] does not have guaranteed convergence, but the SDP framework presented here does. As shown earlier, the distance and dot product constraints in the camera may be written as a polynomial in the betas with coefficients C. The two variable case is provided as an example. l l l βa βb − βb2 + C11 βa2 + C02 El = C20 M dl 2 l l l l = P20 βa2 + P31 βa3 βb + P40 βa4 + P02 βb2 (4.144) l l l l + P13 βa βb3 + P04 βb4 + P11 βa βb + P22 βa2 βb2 When the error is squared the coefficients are transformed into a higher dimensional space. Effectively, the quadratics becomes a quartic. The following expression outlines the mapping to squared error, C → P . l l M P20 = −2 C20 dl l l l P31 = 2 C11 C20 l l 2 P40 = (C20 ) l l M P02 = −2 C02 dl (4.145) l l l P13 = 2 C02 C11 l l 2 P04 = (C02 ) l l M P11 = −2 C11 dl l l 2 l l P22 = (C11 ) + 2 C02 C20 Now the objective function may be written in linear form, E = P T X, where h iT X = βa2 βa3 βb βa4 βb2 βa βb3 βb4 βa βb βa2 βb2 h P = P20 P31 P40 P02 P13 P04 P11 P22 iT . (4.146) (4.147) In this algorithm, semidefinite matrix M is actually constructed from three smaller 67 matrices that are semidefinite themselves, M0 0 0 M = 0 M1 0 . 0 0 M2 (4.148) M0 is a vector weighting constraint used to force the magnitude of the first eigenvector, βa , to be larger than all others. M1 and M2 are both “moment” matrices that are constructed from M1 = X1 X1T (4.149) M2 = X2 X2T . (4.150) The LMI constraints of the two variable case are provided as an example. h iT X1 = βa βb iT h X2 = 1 βa2 βb2 βa βb h i M0 = βa2 − βb2 " # βa2 βa βb M1 = βa βb βb2 1 βa2 2 βa βa4 M2 = β 2 β 2β 2 b a b βa βb βa3 βb βb2 βa2 βb2 βb4 βa βb3 βa βb βa3 βb βa βb3 2 2 βa βb (4.151) (4.152) After the SDP terminates, all of the variables within vector X have been estimated. The beta variables in the original problem formulation can easily be recoved. Because the sum-of-squares decomposition is not unique, a (trivial) sign ambiguity exists that places the simplex in front or behind the camera. The two variable case is shown as an example below. p p βa = ± βa2 = ± X1 βa βb X7 βb = = βa βa 68 (4.153) During the discussion of the algorithm, only examples of the two variable case where provided. However, it is more desirable to use additional eigenvectors within the SDP. Obviously, there is a performance/complexity trade-off, and it is not recommended to use more than four eigenvectors. The appendix contains additional information that specifically outlines the three and four variable cases. Note that we found the best results are produced with four eigenvectors but presently have not been able to prove why it happens. 69 Chapter 5 Recursive Estimation The previous chapters describe the two measurement sources (or inputs) to the pose estimation system; specifically, how raw data is transformed into a usable measurement. Regardless of its source, each measurement may be thought of as a noisy observation of the same object at some point in time. A method is needed to fuse the measurements together and account for their uncertainty. This chapter covers recursive state estimation for the sole purpose of pose estimation. [35] is an excellent resource for recursive estimation techniques. Figure 5.1: Markov Process 70 5.1 Recursive Estimation of Markov Processes A Markov process, figure 5.1, is a stochastic process in which the probability distribution of the current state is conditionally independent of the path of past states. A Markov model is a statistical model in which the system being modeled is assumed to be a Markov process. In a regular Markov model, the state is directly observable. In a hidden Markov model, the state is not directly observable, but variables influenced by the state are observable. Because of the Markov assumption, the true state is conditionally independent of all earlier states given the immediately previous state. p(xk |x0 , . . . , xk−1 ) = p(xk |xk−1 ) (5.1) Similarly the measurement at the kth time step is dependent only upon the current state and is conditionally independent of all other states given the current state. p(zk |x0 , . . . , xk ) = p(zk |xk ) (5.2) Using these assumptions the probability distribution over all states and measurements of the HMM can be written simply as: p(x0 , . . . , xk , z1 , . . . , zk ) = p(x0 ) k Y p(zi |xi )p(xi |xi−1 ) (5.3) i=1 Recursive Bayesian estimation is a general probabilistic approach for estimating an unknown probability density function recursively over time using incoming measurements and a mathematical process model. Because all uncertainty is our system is assumed to be Gaussian, we are concerned with a family of recursive estimators, collectively called Gaussian filters. Historically, Gaussian filters constitute the earliest tractable implementations of Bayes filter for continues spaces. They are also by far the most popular family of techniques to date - despite a number of shortcomings. Gaussian techniques all share the basic idea that probability distributions are represented by multivariate normal distributions. In the moments parameterization, the density is parameterized by a mean and covariance, the first and second moment respectively; all higher moments are completely specified by the first two moments. 71 5.2 Gaussian & Nonparametric Filters The most studied technique for implementing Bayes filters is the Kalman Filter which was introduced by Swerling (1958) and Kalman (1960). It was invented as a technique for filtering and prediction of linear Gaussian systems by implementing belief computation for continuous states. Beliefs (conditional probability) are represented by the moments parameterization. Posteriors are Gaussian if the following three properties apply, in addition to the Markov assumptions of the Bayes filter. The state update conditional probability, p(xk |uk , xk−1 ), must be Gaussian. One way of specifying this is through the linear discrete-time dynamic equation xk = Fk xk−1 + Gk uk + wk , (5.4) wk ∼ N (0, Qk ) . (5.5) where The measurement conditional probability, p(zk |xk ), must also be Gaussian. This can be specified by the linear algebraic equation zk = Hk xk + vk , (5.6) vk ∼ N (0, Rk ) . (5.7) where The initial belief must also be normally distributed p(x0 ) = N (0, Σ0 ) . (5.8) If the properties above are true, the Kalman filter calculates Σk|k−1 = Fk Σk−1 FkT + Qk (5.9) as a prediction of state uncertainty where Q is the process noise covariance. The state and uncertainty predictions are referred to together as the “time update” equations. The state and uncertainty are corrected when a measurement is received. First, the 72 Kalman gain or blending factor, Kk = Σk|k−1 HkT (Hk Σk|k−1 HkT + Rk )−1 , (5.10) is calculated that minimizes the a posteriori error covariance. Next, the measurement is incorporated to generate an a posteriori state estimate via xk|k = xk|k−1 + Kk (zk − Hk xk|k−1 ) . (5.11) Finally, an a posteriori error covariance is obtained with the Kalman gain, Σk|k = (I − Kk Hk )Σk|k−1 . (5.12) These three equations are collectively referred to as the “measurement update” equations. Unfortunately, state transitions and measurements are rarely linear in practice. The Extended Kalman Filter (EKF) relaxes the linearity assumption at the expense of accuracy. Instead, state transition and measurement probabilities are governed by (differentiable) nonlinear functions, xk = f (xk−1 , uk ) + wk (5.13) zk = h(xk ) + vk . This violates assumptions made during the derivation of the Kalman filter. A Gaussian probability passed through a linear function remains Gaussian but is not necessarily Gaussian if passed through a nonlinear function. The EKF inherits from the Kalman Filter the basic belief representation, but it differs in that this belief is only approximate. If the initial state estimate is wrong or if the process is modeled incorrectly, the filter may quickly diverge. The central idea of the EKF is linearization, approximation of the nonlinear functions about the current state. However, the estimated covariance matrix tends to underestimate the true covariance. After linearization the belief propagation is equivalent to the Kalman Filter. The uncertainty prediction equation of the EKF is the same as within the Kalman Filter except the matrix, Fk , is the Jacobian of the nonlinear update function, i.e. Σk|k−1 = Fk Σk−1 FkT + Qk . 73 (5.14) Similarly, the Kalman gain is computed with the Jacobian Hk of the the nonlinear measurement function via Kk = Σk|k−1 HkT (Hk Σk|k−1 HkT + Rk )−1 . (5.15) As expected, the a posteriori state estimate is calculated with the nonlinear function, xk|k = xk|k−1 + Kk (zk − hk (xk|k−1 )) . (5.16) Finally, an a posteriori error covariance is obtained with the Jacobian and Kalman gain, Σk|k = (I − Kk Hk )Σk|k−1 . (5.17) The EKF uses a first order Taylor series to approximate the nonlinear functions. Matrices of partial derivatives, Jacobians, are computed as each iteration. The goodness of the linear approximation applied by the EKF depends on two main factors: The degree of uncertainty and the degree of local nonlinearity of the functions that are being approximated. In practice, it is important to keep the uncertainty small. However, the EKF can give reasonable performance. The Taylor series expansion applied by the EKF is only one way to linearize the transformation of a Gaussian. The Unscented Kalman Filter (UKF) uses a different approach and has been found to often yield superior results, [22]. It performs a stochastic linearization through the use of a weighted statistical linear regression process. The UKF deterministically extracts so-called sigma points from the Gaussian and passes them through the nonlinear functions. These points are located at the mean and along the main axes of covariance, two per dimension. For an n-dimensional Gaussian, 2n+1 sigma points are chosen. Recall that in the EKF only the mean is considered. After the sigma points are passed through the nonlinear function the mean and covariance are recovered by a weighted sum. For nonlinear systems the UKF produces equal or better results than the EKF where large improvements occur as points of high nonlinearity and uncertainty. The filter more accurately captures the true mean and covariance. Another advantage of the UFK is that it does not require the computation of Jacobians, which are difficult to determine in some domains. The UKF is thus often referred to as a derivative-free filter. The particle filter, also known as Sequential Monte Carlo (SMC) method, is an alternative nonparametric implementation of the Bayes filter. They were first introduced in the early 50s by physicists and have become very popular recently in statistic and related 74 fields. Nonparametric filters do not rely on a fixed functional form of the posterior, such as Gaussians. The state space is approximated by a set of random samples drawn from the posterior distribution. The quality of the approximation depends on the number of samples used to represent the posterior. As the number of samples goes to infinity, nonparametric techniques tend to converge uniformly to the correct posterior, under weak assumptions. Such a representation is approximate, but it is nonparametric, and therefore can represent a much broader space of distributions. Another advantage of sample-based representation is the ability to model nonlinear transformations of random variables. The representational power of these techniques comes at the price of added computational complexity. Each particle is an instantiation of the current state. The denser a sub region of state space is populated by samples, the more likely is that the true state falls into this region. Like the UKF, particles are assigned weights, but the samples are not recombined into an estimate of mean and covariance. Instead, some of the particles are kept and others are replaced by resampling. The resampling step has the important function of forcing particles back into regions of the state space with high posterior probability. 5.3 Selection of Quaternion EKF When the pose estimation literature is surveyed, quaternions are found to be a favorite parameterization choice in every domain, from aerospace to AR, because of their nonsingular mapping. Consequently, the rigid body motion model using quaternions (chapter two) is adopted. Understanding system dynamics is an initial step in the design of a recursive estimator. From the previous section, we are faced with three different estimators, EKF, UKF and particle filter. It will be shown below that the state transition and measurement functions are not highly nonlinear, e.g. no trigonometric functions. Therefore, the Taylor series linearization method of the EKF should perform fairly well, and the added complexity of the UKF is unwarranted. This hypothesis is supported by the work of [21] that showed no significant benefit of the UKF over the EKF for quaternions. Also, it is assumed that all probability distributions are Gaussian. The assumption is approximately true for the sensor noise which can be verified through inspection of actual measurement data. The process noise has yet to be carefully studied and might happen to be non-Gaussian. The work of [1] suggests that a particle filter works better than an EKF for 75 head motion within AR. However, the slightly improved estimation accuracy did not seem to out weigh the additional complexity. Additionally, we have a method of computing camera uncertainty that cannot be easily integrated into the particle filter. The uncertainty calculation is covered in the last section of this chapter. For the reasons stated above, a quaternion-based EKF was chosen for use within our augmented reality system. 5.4 Time Update Figure 5.2: EKF Prediction and Correction Cycle The time update is responsible for projecting forward, or predicting, in time the current state and uncertainty estimates to obtain a priori estimates at the next time step. Recall from the Markov assumption that only the previous state is required to predict the current state. 5.4.1 State Vector The state vector contains seven rotational terms and nine translational terms and is therefore length sixteen. The rotational terms include the quaternion, M C q, and angular velocity, C ωcam . The translational terms include the position, Mpcam , the velocity, Mvcam , and the 76 acceleration, M acam . x= h M Cq C ωcam M pcam M vcam M acam iT h iT = q0 qx qy qz ωx ωy ωz px py pz vx vy vz ax ay az (5.18) Note that the quaternion and position elements together form the transformation of the M M camera {C} with respect to the model {M}, H M C q, pcam = C H. In contrast, and maybe a subtlety, the computer algorithms and perspective projection model of chapter four rely on the inverse transformation, C MH. 5.4.2 State Prediction The state transition function is linear for all states except the quaternion. The same is true for all rotation parameterizations, but with quaternions singularities are avoided. k wq g(xk−1 , ∆t) qk|k−1 k wω ωk|k−1 ωk−1 2 pk|k−1 = pk−1 + ∆t vk−1 + ∆t ak−1 + wk p 2 k vk−1 + ∆t ak−1 wv vk|k−1 wak ak−1 ak|k−1 (5.19) This expression may be written more compactly as a single nonlinear function xk|k−1 = f (xk−1 , ∆t) + wk−1 . (5.20) Because the time step, ∆t, and angular velocity, ω, are both assumed to be small, the simple quaternion update function of equation 2.70 is used, i.e. g (xk−1 , ∆t) = qk−1 + ∆t QR (Ωk−1 ) qk−1 . 2 (5.21) This assumption makes the function less nonlinear by avoiding the trigonometric functions of equation 2.71 but might compromise accuracy if the assumption is violated. Recall that the EKF can perform very poorly in areas of high nonlinearity. 77 5.4.3 Uncertainty Prediction The Jacobian, Fk , of the state transition function may be quickly computed analytically. At each time step, only the elements associated with the quaternion will change, a 4x7 sub-block. i.e. Fk = ∂f (xk−1 , ∆t) ∂xk−1 ∂g(xk−1 ,∆t) ∂g(xk−1 ,∆t) = ∂qk−1 ∂ωk−1 0 0 0 0 I3 0 0 0 0 0 0 0 0 0 2 I3 ∆t ∆t2 . 0 I3 ∆t 0 0 I3 (5.22) It is assumed that the variance of the process disturbance model increases as sample time increases. In other words, inaccuracies of the process model itself are exaggerated as the amount of time increases. For this reason, a linear time-varying uncertainty matrix is used, σq2 s 0 Qk = ∆t 0 0 0 where the 5.5 σx2 s 0 2 σω s 0 0 0 0 0 σp2 s 0 0 0 0 0 σv2 s 0 0 0 0 , 0 (5.23) σa2 s terms represent units of variance per second. Measurement Update In the measurement update step the state conditional probability distribution is updated with sensor data. In our system, two of the measurement functions are linear and one is nonlinear. The degree of nonlinearity is low and arises from vector rotation with quaternions. For any rotation parameterization a similar nonlinear function would be required. 78 5.5.1 Camera Update Of the camera, gyros, and accelerometers, the camera is the best measurement source for the pose estimation system. It provides absolute position and orientation data, " zcam # ±qk = + vk , pk (5.24) whereas the IMU only provides derivatives of these states. Fortunately, the camera’s measurement function is linear, " # qk|k−1 Hcam xk|k−1 = (5.25) pk|k−1 where " Hcam I4 0 0 0 0 = 0 0 I3 0 0 # , (5.26) and therefore the filter will produce an (nearly) optimal estimate of state and uncertainty. Measurements are received at frame rate assuming the fiducials are within the field of view, typically 25-30 Hz. Before using a camera measurement, it is necessary to perform a simple check for the “mirrored” quaternion, −(qk ). Recall that the uniqueness of a rotation quaternion is two, i.e. two opposite points or poles on the hypersphere. The Kalman Filter will effectively interpolate between the predicted and measured quaternion so it is of great importance to use the correct sign. Specifically, if q̂k|k−1 − qk > q̂k|k−1 + qk than the quaternion of the measurement should be negated. 5.5.2 Gyro Update The gyro noise level is relatively low and the sample time is much quicker than the camera, ∼ 100 Hz. Thus, the sensor helps smooth orientation between camera updates. Because angular velocity is one of the states, zgyro = ωk + vk , 79 (5.27) the gyro measurement function is also linear, Hgyro xk|k−1 = ωk|k−1 (5.28) h i Hgyro = 0 I3 0 0 0 . (5.29) where 5.5.3 Accelerometer Update In our system, the IMU returns angular velocity and acceleration at the same rate and within the same measurement vector. The measurements are presented separately for a couple of reasons. First, in general, a system might have only gyros or accelerometers; or, a system might have both sensors which operate asynchronously. Secondly, in our implementation the accelerometer is not always used (discussed below). An acceleration measurement is a combination of gravity and acceleration of the IMU itself, C zacl = Cacam + Cagrav + vk , (5.30) which has been rotated into the camera frame with the rotation matrix discussed in chapter three. A vector expressed in the camera or model frame can be rotated using the predicted orientation and quaternion multiplication via " and " # " # 0 0 −1 = qk|k−1 ⊗ C ⊗ qk|k−1 M zacl zacl (5.31) # " # 0 0 −1 = qk|k−1 ⊗ M ⊗ qk|k−1 . C zacl zacl (5.32) Alternatively, the same operation can be performed by converting the quaternion back into a rotation matrix and then multiplying the vector, such as C zacl = C M MR zacl −1 = R(qk|k−1 ) Mzacl . (5.33) We make the following assumption for human motion that would likely not apply to most other domains. When the magnitude of acceleration is approximately equal to 9.8 sm2 , 80 it is assumed that only gravity is measured and the IMU has no acceleration in the world frame. C m Ca , zacl ≤ (9.8 m2 + σa ) (9.8 − σ ) ≤ grav a 2 s s C (5.34) zacl = C Ca + a , otherwise cam grav We have found that our accelerometers contain a significant level of noise, which if used to update position, will be integrated twice. For this reason, the sensor is only used to update orientation at times when the magnitude is approximately equal to that of gravity. However, for completeness, the measurement update function for both position and orientation is presented later. Furthermore, future enhancement of the AR system might involve replacing the IMU with a less noisy sensor. 5.5.3.1 Orientation Both measurement updates rely on the fact that we precisely know gravity in the model. Gravity vectors in the model and camera frames are related by a simple rotation, C agrav = C M MR agrav . (5.35) To correct orientation only, gravity is rotated into the camera frame with the predicted quaternion −1 hacl xk|k−1 , Magrav = R(qk|k−1 ) Magrav . (5.36) Nonlinearity is introduced by the parameterization of the rotation matrix with the quaternion. The Jacobian may be quickly computed analytically as k Hacl = 5.5.3.2 h ∂hacl (xk|k−1 , Magrav ) ∂qk|k−1 i 0 0 0 0 . (5.37) Position and Orientation Composite acceleration vectors in the model and camera frames are also related by a simple rotation, C M acam + Cagrav = C acam + Magrav . (5.38) MR 81 To correct both position and orientation, acceleration is rotated into the camera frame with the predicted quaternion, −1 hacl xk|k−1 , Magrav = R(qk|k−1 ) ak|k−1 + M agrav . (5.39) The Jacobian is almost the same as above but includes an additional term for acceleration within the state vector, k Hacl = 5.5.4 h ∂hacl (xk|k−1 , Magrav ) ∂qk|k−1 0 0 0 ∂hacl (xk|k−1 , Magrav ) ∂ak|k−1 i . (5.40) Quaternion Constraint The Kalman Filter, EKF, and UKF do not directly impose any constraints on vector length. Consequently, the quaternion might not be unit length after a measurement update. One option is to simply perform a brute force normalization after each measurement update. q̂k|k = qk|k kqk|k k (5.41) Unfortunately, this does not fix the problem entirely because the uncertainty is also constrained in a similar way as the state. Several papers propose techniques by modifying the Kalman Filter and adding a correction step, [39], [7], [43]. In our case, no ill-effects are observed because the sample time and uncertainty are both assumed to be small. 82 5.6 Camera Uncertainty All measurement sources into the EKF contain an estimated uncertainty in the form of a covariance matrix. The gyro and accelerometer covariance matrices remain constant throughout time whereas the camera uncertainty, Σkcam , is different at each time step. Noisy image points cause uncertainty in the final solution of the computer vision algorithm(s). How this noise propagates through the optimization depends on the pose of the model with respect to the camera and the configuration of model points. Some poses and configurations are more sensitive to noise and should therefore be assigned a higher degree of uncertainty. In this section, methods are developed for calculating pose uncertainty with quaternions in object space. An unconstrained image space pose uncertainty calculation can be found in Vincent and Hoff [41]. 5.6.1 Derivation Recall from chapter four that a computer vision algorithm will minimize a residual in either image space or object space. This derivation relies on prior knowledge of which error space the minimization occurred and a model of image point noise, assumed to be Gaussian with zero mean and known variance. For reasons already stated previously, the elements of the rotation matrix are often parameterized, and that the entire 6-DOF pose can be parameterized by a minimum of six parameters. The image space, 2 2 r2 (Θ) MPmi + ty (Θ) r1 (Θ) MPmi + tx (Θ) + u0 + nu − ui + fv + v0 + nv − vi , ei (Θ, n) = fu r3 (Θ) MPmi + tz (Θ) r3 (Θ) MPmi + tz (Θ) (5.42) and object space, 2 ei (Θ, n) = (I − Fi (n)) R(Θ) MPmi + t(Θ) , (5.43) error functions are rewritten to include a noise term and pose parameterization. Notice how noise propagates though the functions differently. In our particular case, quaternions are used within the EKF so the pose parameterization vector, h iT Θ = q0 qx qy qz tx ty tz , 83 (5.44) is length seven, where H(q, t)−1 = function and single variable, f (x) = ∞ X f (n) (a) n=0 n! C MH. Recall the form of the Taylor Series for an arbitrary 0 (x − a)n = f (a) + 00 f (a)(x − a) f (a)(x − a)2 + + ... . 1! 2! (5.45) The error functions are approximated by a first-order Taylor Series in two variables, pose and noise, via ∂ei (Θ̂, 0) ∂ei (Θ̂, 0) (Θ − Θ̂) + (n − 0) ∂Θ ∂Θ , ≈ ei (Θ̂, 0) + Ai (Θ − Θ̂) + Bi n ei (Θ, n) ≈ ei (Θ̂, 0) + (5.46) where Θ̂ is the estimated pose of the computer vision algorithm solution and Θ is the true pose which, of course, is not known in practice. It is assumed that the computer vision algorithm converged properly to the global minima of the error function. Therefore, the first order Taylor series is an approximation of the function evaluated at the minima, min Θ n X i=1 kei (Θ, n)k2 ≈ min Θ n 2 X ei (Θ̂, 0) + Ai (Θ − Θ̂) + Bi n . (5.47) i=1 Let Θ̃ equal the approximate error between the true and estimated pose, Θ − Θ̂. It can be calculated, in a least squares sense, using the projection theorem, Θ̃ = (AT A)−1 AT (E(Θ̂, 0) + B n) (5.48) = C(E(Θ̂, 0) + B n) = C E(Θ̂, 0) + C B n . The large matrices A and B are constructed from the entire set. The A matrix is vertically stacked Specifically, A1 B1 . . A= . , B=0 An 0 individual Jacobians of each point for while the B matrix is block diagonal. 0 .. . 0 0 0 . Bn (5.49) Because the image noise is assumed to have zero mean, n̄ = 0, the average error between the 84 true and estimate pose is equal to ¯ = C E(Θ̂, 0) . Θ̃ (5.50) Therefore, the standard deviation of the error between the true and estimate pose is approximately h i ¯ = C E(Θ̂, 0) + C B n − C E(Θ̂, 0) = C B n . (Θ̃ − Θ̃) (5.51) A least-squares estimate of the pose error covariance can now be derived. Its accuracy is only as good as the first order linearization. T ¯ ¯ COV(Θ̃) = E Θ̃ − Θ̃ Θ̃ − Θ̃ h i = E (C B n) (C B n)T = E C B n nT B T C T = C B E n nT B T C T (5.52) = C B COV(n) B T C T 5.6.2 Image Space The pose uncertainty calculation assuming an image space minimization is presented first because more computer vision algorithms operate in image space and the calculation itself is easier. As stated earlier, the image point noise is assumed to be Gaussian with zero mean and known variance, " # " # nu N (0, σu ) ∼ . (5.53) nv N (0, σv ) If the noise components of a single image point are uncorrelated, the covariance may be written as " # σu2 0 cov(n) = . (5.54) 0 σv2 85 Furthermore, a large covariance matrix for the entire point set may be constructed from smaller covariance matrices along the diagonal, cov(n) 0 0 .. COV(n) = . 0 0 0 0 cov(n) (5.55) The parameterized image space error function has two error components, r1 (Θ) MPmi + tx (Θ) + u0 + nu − ui r3 (Θ) MPmi + tz (Θ) (5.56) r2 (Θ) MPmi + ty (Θ) + v0 + nv − vi . r3 (Θ) MPmi + tz (Θ) (5.57) eiu (Θ, n) = fu eiv (Θ, n) = fv Because noise enters linearly into the error function, the Jacobians with respect to noise, Bi , equal identity and the pose error covariance equation 5.52 can be simplified to COV(Θ̃) = C B COV(n) B T C T = C I COV(n) I T C T (5.58) = C COV(n) C T . The Jacobian for a single image point with quaternion pose parameterization has the form of # " i i Ai = 5.6.3 ∂eu (Θ̂,0) ∂q ∂eiv (Θ̂,0) ∂q ∂eu (Θ̂,0) ∂t ∂eiv (Θ̂,0) ∂t . (5.59) Object Space The pose uncertainty calculation in object space is more complicated than image space, but it is still desirable to have for algorithms such as the Orthogonal Iteration Algorithm of chapter four. Because noise enters nonlinearly into the error function through the projection matrix, (wi + n)(wi + n)T Fi (n) = , (5.60) (wi + n)T (wi + n) the simplified pose error covariance calculation of image space does not apply. Recall that the projection matrix requires normalized image points, but defining a noise distribution within a 86 full-size image is more intuitive. In other words, let the full-image noise distribution undergo a transformation, # " n= N (0,σu ) fu N (0,σv ) fv . (5.61) Fortunately, the new distribution is still Gaussian because the operation is linear, i.e. division of focal length. Now the modified covariance matrix for an individual image point is written as # " ( σfuu )2 0 , (5.62) cov(n) = 0 ( σfvv )2 and the large covariance matrix of the entire point set is constructed like before. The parameterized object space error function has three components, eix (Θ, n) = I1 − f1i (n) R(Θ) MPmi + t(Θ) (5.63) eiy (Θ, n) = I2 − f2i (n) R(Θ) MPmi + t(Θ) eiz (Θ, n) = I3 − f3i (n) R(Θ) MPmi + t(Θ) . (5.64) (5.65) The Jacobians (pose and noise) for a single image point with quaternion pose parameterization have for form of i i i i Ai = 5.6.4 ∂ex (Θ̂,0) ∂ei ∂q y (Θ̂,0) ∂q ∂eiz (Θ̂,0) ∂q ∂ex (Θ̂,0) ∂t ∂eiy (Θ̂,0) ∂t ∂eiz (Θ̂,0) ∂t , Bi = ∂ex (Θ̂,0) u ∂ei∂n(Θ̂,0) y ∂nu ∂eiz (Θ̂,0) ∂nu ∂ex (Θ̂,0) ∂nv ∂eiy (Θ̂,0) ∂nv ∂eiz (Θ̂,0) ∂nv . (5.66) Quaternion Constraint The unit length constraint is a disadvantage of quaternions as a parameterization within the pose uncertainty calculation. The constraint that the true quaternion is unit length, kqk = 1, and that the error lies along the hypersphere must somehow be enforced. Recall the minimization from the pose uncertainty derivation, 2 2 min e(Θ̂, 0) + A Θ̃ + B n = min y − A Θ̃ . Θ̃ (5.67) Θ̃ Forcing the error to be along the hypersphere cannot be achieved through a linear operation. Instead, the constraint is relaxed so that the error lies in a hyperplane which is tangent at 87 Figure 5.3: Hyperplane Tangent to Hypersphere the point of the estimated quaternion. See figure 5.3. This is equivalent to saying the error vector must be orthogonal to the estimated quaternion, and therefore, the dot product should equal zero, q̂ · (q − q̂) = 0. The minimization is rewritten to include the dot product constraint, 2 min y − A Θ̃ such that D Θ̃ = 0 , (5.68) Θ̃ where h i D = qˆ0 qˆx qˆy qˆz 0 0 0 . (5.69) This constraint forces the error vector into the null space of D. The translation elements are free dimensions because they are not part of the dot product or orthogonality constraint that defines the hyperplane in 4-dimensional space. N = Null(D) = {x ∈ Rn : D x = 0} (5.70) DN x = 0 ∀x (5.71) The Householder Transformation is a very quick way to calculate the null space of a vector. The minimization is written again to include the linear null space constraint. min ky − A N xk2 (5.72) x̂ = ((A N )T (A N ))−1 (A N )T y (5.73) x Finally, the pose covariance equation is modified slightly to include the quaternion constraint, i.e. C = N ((A N )T (A N ))−1 (A N )T . (5.74) 88 Chapter 6 Real-time System The goal of this chapter is to show how all of the concepts covered previously are connected into a functional real-time augmented reality system. Brief hardware and software descriptions are provided first. Then, important software subsystems are covered in more detail. 6.1 Hardware Description The hardware design of the system is very simple because off-the-shelf components could be used. In other words, no custom hardware was needed. See figure 6.1 for a complete list. The Prosilica GC1020C and Crista IMU are the only sensor inputs to the system. A gigabit ethernet camera was chosen because of its ability to reliably stream massive amounts of video data over long cable distance. Powerful on-board processing converts raw Bayer images to 8-bit RGB and then transmits them as very large 9K packets, ”Jumbo frames”. A camera of this type places minimal burden on the workstation allowing resources to be dedicated to other processes. A wide-angle lens, ∼ 70◦ , is used so that cones remain in the field of view as often as possible. The Crista IMU was chosen because of its low cost, standard RS232 interface, and prior experience with the product. A middle-end workstation with dual-core Xeon processors and 2GB of RAM is only slightly modified by replacement of the graphics and ethernet cards. Figure 6.2 shows the sensor system with LCD attached by a custom mounting bracket that was manufactured within a of couple hours. The sensors and display are all rigidly 89 Prosilica GC1020C Crista IMU Dell Precision 690 Nvidia 8800GTS Xenarc 840YV Kowa LM4NCL Intel PWLA8391GT 4” Traffic Cones (x3) Gigabit Ethernet Camera, 1024x768, 30fps, 1/3” CCD MEMS Accelerometer and Gyroscope, 3Axis, 100Hz, Serial Interface Dual Core Xeon 5130 2GHz, 4MB L2 Cache 1333MHz, 2GB DDR2 ECC SDRAM Memory 667MHz Middle-end Graphics Card / GPU 8.4” LCD, 1024x768 Wide-Angle C-mount Lens, 3.5mm F1.4 Gigabit Ethernet Card, Jumbo Frames Supported, 9K Packet Size Fiducials, Larger Size Could Easily Be Used $1200 $1300 $1300 $200 $200 $100 $25 $10 Figure 6.1: Hardware Components & Sensors attached. To reduce the induced moment of the IMU about the camera, the sensors are placed as closely together as possible, less than 6cm. The LCD is aligned with the camera so that the relationship between the two coordinate frames is merely a small translation in the negative z direction, along the camera’s optical axis. To support the choice of a video see-through LCD, a short discussion of augmented reality display options is provided next. As stated in chapter one, basically two different display technologies are available for use in AR, optical and video see-through. Optical see-through allows the user to see the real world directly. Graphics are projected into the user’s field of view with an optical combiner, such as half-silvered mirrors. Video see-through does not let the user see the real world directly. Instead, a video stream is blended with graphics and displayed to the user. Typically, head-mounted displays (HMD) are used but a handheld LCD is another option. Any display will have advantages and disadvantages but the choice of technology typically depends on the application requirements. Optical blending is less computationally expensive than video blending because only virtual objects must be rendered. Also, the video stream may be considerably warped by lens distortion, and the focal length will likely not be equivalent to that of the human eye. Video see-though is also limited by the physical resolution of the camera and artifacts may be introduced by compression. Optical combiners reduce the amount of light from the real world that reaches the 90 Figure 6.2: Sensors & LCD eye, somewhere around 30%. As a result, virtual objects do not completely obscure realworld objects because light is combined from both real and virtual sources. In high ambient light environments, virtual graphics will appear opaque or may even vanish completely. Optical see-though offers an instantaneous view of the real-world, but the virtual objects will inherently be delayed by tracking and rendering delays. Video see-though has the advantage of matching real and virtual graphics in software when delays are present. In addition to estimating pose of the display itself, optical see-though requires an extra calibration step and eye tracking for accurate registration. The calibration, known as occultation, models the distance of the eyes relative to each other and the display. Similarly, in video see-through, a large separation between camera and display can create perceptual discomfort, but the mind is typically able to seamlessly compensate for small disparity. Finally, video see-though has complete control over what the user sees, at the pixel level, and may ultimately produce more compelling environments. If a camera is already present for tracking purposes, it makes sense to use it within a video see-through display. 91 Pose Estimation & AR (CSM-PEARL) Open Computer Vision (OpenCV) wxWidgets (?) Custom Testbed Library Image Processing & Computer Vision Library Performance 3D Graphics Library Cross-Platform GUI Library Prosilica Gigabit Ethernet (PvAPI) Cloudcap Serial-IMU (CSI?) Ethernet Camera Interface Serial IMU Interface Integrated Performance Primitives (IPP) Optimized Functions for Intel Chipsets Open Scene Graph (OSG) Awesome! Open Source Open Source Open Source Free Open Source Free* Figure 6.3: Libraries / Dependencies 6.2 Software Description Unlike the hardware design, the software architecture design is much more complicated. The seven libraries shown in figure 6.3 were needed. Fortunately, all of the libraries are either open source or free for educational use. A large code base (CSM-PEARL) was written specifically for the testbed and is provided on-line for download (control.mines.edu). Figure 6.4 shows an overall view of the software architecture. Data from the two sensors is processed and placed into a circular buffer. The EKF pulls measurements from the buffer and produces a filtered pose estimate. Graphics are generated by blending the most recent pose estimate and camera image. A GUI allows control of sensors, EKF filter parameters, and various rendering options. A real-time system like this will inevitably require parallel processing. Figure 6.5 lists the threads that were required. 6.3 Measurement Buffer The measurement buffer, figure 6.6, is an essential subsystem which does much more than simply buffer measurements. It coordinates activity of the sensor threads and EKF thread. The workstation is powerful enough so that the buffer never becomes completely full. Measurements and time-stamp from both sensor sources are placed into a set of nodes, figure 6.7. The circular buffer is very similar in structure to a standard linked list with a few extra 92 Figure 6.4: Full System Camera Vision IMU EKF Graphics GUI Color Classification & Buffering Segmentation, Pose Estimation, & Buffering Data Processing & Buffering Sensor Fusion & Recursive Estimation Virtual Model Update & Data Logging User Control of Sensors & Estimation Figure 6.5: Threads 93 Figure 6.6: Measurement Buffer 94 Figure 6.7: Measurement Node features. As soon as new sensor data begins to be processed, the current write node it set to pending status and a time-stamp field is recorded. When processing is completed, the final measurement is written and pending status is dropped. The pending concept is extremely important here because the sensors operate asynchronously and the camera processing time is much longer than the IMU. Consequently, IMU measurements are allowed to “pile up” behind a camera measurement in the buffer. Typically, the EKF waits a short time for a camera measurement to be completed (no longer pending) and then all of the IMU measurements behind it are quickly processed afterwards. The measurement type tells the EKF which data structures within a node to use, either camera or IMU. Additionally, the type may be set to null so that if a pending measurement fails to yield a valid measurement, such no fiducials in the field of view, it is still completed. When the EKF sees such a node it is simply ignored and quickly moves on to read the next node. 95 Figure 6.8: Vision Thread 96 6.4 Vision Thread The vision thread is the most demanding of all threads, figure 6.8. It receives a binary image from the camera buffer and produces a pose estimate with associated uncertainty. Cones are segmented according to the procedure outlined in chapter two. Given a predicted pose from the EKF, the model point locations are corrected, and then correspondence is found between the 2D-3D set. The non-iterative 3-Simplex and Orthogonal Iteration algorithms of chapter four generate a pose estimate. The 3-simplex algorithm is not used when a valid initial guess from the EKF is available. The object space uncertainty algorithm of chapter five generates a new covariance matrix at each iteration. Before accepting a pose estimate to be valid, two constraints are imposed. The goal of the constraint is to reject vision solutions that do not converge properly to the global minima. This could happen for three reasons. First, the segmented image points may be extremely noisy or a cone could have been falsely detected. Second, the 2D-3D correspondence may not be correct. Last, the algorithm may have been provided a bad initial guess far outside the neighborhood of the global minima. The camera uncertainty calculation assumes that an error function has been minimized, and it is therefore important to detect outliers that violate the assumption. If an outlier is not caught, the uncertainty will be severely underestimated and the EKF will trust the measurement much more than is should. 6.4.1 Image Space Error Residual Constraint Recall the form of the image space error function 4.1 as a sum of squared 2-dimensional error on the image plane. Assume the individual errors of each image point are iid Gaussian random variables, eiu ∼ N(0, σu ) and eiv ∼ N(0, σv ). Therefore, equation 4.1 can be modified slightly into the form of the unit variance Chi-Square distribution. χ2k = k X (x̂ − xi )2 i=1 x= n X (ûi − ui )2 i=1 σu2 (6.1) σx2 n 2 2 (v̂i − vi )2 X (eiu ) (eiv ) + = + , k = 2n σv2 σv2 σv2 i=1 (6.2) A threshold probability and the Chi-square CDF define acceptance and rejection re97 gions on the residual within a binary hypothesis test. In practice, a lookup table of the CDF is used because the gamma functions, γ(k/2, x/2) and Γ(k/2), are difficult to calculate. See the appendix for more information. Fχ (x, k) = γ(k/2, x/2) Γ(k/2) Accept, F (x, k) ≤ threshold χ Constraint = Reject, otherwise 6.4.2 (6.3) (6.4) Relative Change in Motion Constraint Human motion is restricted to small changes in position and orientation over a short time interval, in our case ∆t < 10ms. Assuming the predicted pose of the EKF is nearly correct, a relative change in motion from the new vision solution may be quickly computed. The model frame remains fixed in time while the camera frame is likely moving. C1 MH C1 C2H = = C MHk|k−1 C1 MH , C2 MH −1 C2 MH = C MHk|k o n → k̂, θ, t (6.5) (6.6) The rotation matrix is decomposed into the angle-axis representation; the axis is ignored. Therefore, relative motion may be described by a single angle and distance. Thresholds are placed on these values to define acceptance and rejection regions on the motion. Accept, |θ| ≤ θ thresh ∩ ktk ≤ tthresh Constraint = Reject, otherwise 6.5 (6.7) Graphics Generation Graphics generation is relatively simple and occurs in two stages, figure 6.9. A background image from the camera is rendered first, and then the 3D model and pose information are rendered on top. A few different rendering strategies were tested before deciding on the one that worked best within our system. First, rendering was performed after all EKF 98 Figure 6.9: Graphics Generation Figure 6.10: C-130 Hercules measurement updates. Then, rendering was reduced to occur only after an IMU update. Finally, a constant rendering rate was adopted which grabs the latest EKF estimate every 25ms. The last option looks the most visually appealing while simultaneously reducing the burden placed on the workstation to a frame rate just above what is humanly detectable. No effort was needed to match camera images and EKF estimates because a delay of one or two frames is not detectable. Three virtual 3D models were tested within the system. In theory, any of the many 3D file formats supported by Open Scene Graph could be used. However, all of our models are natively VRML 2.0 format (.wrl) and then converted to the Wavefront format (.obj). The C130 Hercules of figure 6.10 was used within the military training application discussed in chapter one. All six joints of the Kuka manipulator shown in figure 6.11 can be manually adjusted 99 Figure 6.11: Kuka KR6 with a slide bar to control the pose of the gripper. The idea here is that manipulator kinematics can be taught to students by interactively controlling the arm and seeing it move immediately in 3-dimensional space. Additionally, this capability might be helpful for teleoperation of a mobile manipulator within a very sensitive work area where any action must be carefully planned and verified. Finally, the Matlab peaks plot of figure 6.12 demonstrates the capability to visualize complex 3D data sets. VRML files can be generated from any Matlab plot. It is much more meaningful and easier to comprehend such data when viewed in 3D instead of merely a flat image on a computer screen. 100 Figure 6.12: Peaks Plot 101 Chapter 7 Experiments At this point in the thesis, all background material has been covered that is necessary to understand how the augmented reality system works, and it is now time to show some experimental results. First, various computer vision algorithms (chapter four) are compared through simulation. Next, a Monte Carlo method is used to verify the pose uncertainty calculations (chapter five). Then, a new method for testing overlay accuracy of virtual graphics is developed. Finally, results of a “realism” questionnaire completed by CSM students are presented. 7.1 Computer Vision Algorithms Simulations are used to compare the performance of various computer vision algorithms of chapter four. Performance metrics include position and orientation error, image and object space error, number of outliers, and run time. Each data set contains 500 independent simulations. To make the simulations as realistic as possible, data is generated to match what we expect to encounter in our system, but the results may be generalized to a certainty extent. The effective focal lengths of our camera, fu = 770, fv = 776, are used to project 3D points onto the image plane, 1024x768, and then Gaussian noise is added to the image points. At each iteration, a new pose and point cloud are generated so that the points lie approximately 5’ in front of the camera with 1’ spread. The algorithms are tested with point clouds containing either 10 or 25 points. 102 Results of the computer vision simulations are shown in the form of a notched boxand-whisker plot. The box plot is a useful way to display differences in data sets without making any assumption of the underlying statistical distribution. The spacing between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers. A box has lines drawn at the lower quartile (Q1), median (Q2), and upper quartile (Q3). Whiskers show the extent of the rest of the data. They extend to the most extreme data points that are within 1.5 times the interquartile range, Q3-Q1. Data points beyond the ends of the whiskers are considered outliers. Instead of plotting the location of outliers, the total number is placed at the end of the whiskers (non-standard). In a notched box plot the notches represent a robust estimate of the uncertainty about the medians for box-to-box comparison. Boxes whose notches do not overlap indicate that the medians of the two groups differ at the 5% significance level. In other words, you can conclude, with 95% confidence, that the true medians do differ. 7.1.1 Overall Algorithm Comparison Group 1 1. DLT 2. MDLT (IS) 3. MDLT (OS) 4. 3-Simplex (IS) 5. 3-Simplex (OS) Group 2 1. EPnP (v1) 2. EPnP (v2) 3. 3-Simplex SDP (IS) 4. 3-Simplex SDP (OS) Group 3 1. MDLT + OIA (IS) 2. 3-Simplex + OIA (IS) 3. MDLT + Gauss-Newton (IS) 4. 3-Simplex + Gauss-Newton (IS) Group 4 1. MDLT + OIA (OS) 2. 3-Simplex + OIA (OS) 3. MDLT + Gauss-Newton (OS) 4. 3-Simplex + Gauss-Newton (OS) Figure 7.1: Computer Vision Algorithm Test Groups As shown in figure 7.1, the computer vision algorithms are separated into four groups. The grouping makes it easier to refer to algorithms during the discussion. The abbreviations 103 IS and OS mean image and object space, respectively. Group 1 contains five non-iterative algorithms. Group 2 contains three “eigenvector algorithms”, including two from [27] and a new SDP. Groups 3 & 4 contain algorithm combinations which use either the Orthogonal Iteration Algorithm (OIA) or Gauss-Newton to refine a pose estimate iteratively. The algorithms of groups 3 & 4 operate in either image or object space, respectively. All of the groups were tested separately with either 10 or 25 points and 5 or 10 pixel error. Box plots of orientation error, position error, and runtime are located in the appendix. See figures 9.1 through 9.8. When inspecting the results from all of the groups, one of the first observations to make is that pose error increases with image noise, but with more points the algorithm is less sensitive to noise. The second observation is that reducing an image or object space error yields what seem to be equivalent results, when point depth has small variation. When inspecting group 1 results, the DLT obviously performs much worse than the other four algorithms. It appears as though the MDLT and 3-Simplex work equally as well within a linear algorithm. However, runtime of the simplex algorithms is slightly longer than the MDLT. Similarly, the runtime of the object space algorithms is slightly longer than the image space algorithms. When comparing group 2 against group 1, the results are better. The linear eigenvector method (EPnP v1) shows some improvement. A slightly higher gain is seen with the Gauss-Newton method (EPnP v2), and the most improvement is seen with the new 3-Simplex SDP. However, the runtime of the simplex algorithm is quite terrible. Converting the problem into a sum-of-square introduces complexity but allows the use of well-known proofs of convexity and global convergence. In some situations, finding a good initial guess for an iterative algorithm might be more important than runtime. When inspecting groups 3 & 4 the results of both groups look very similar. Recall that the only difference in the groups was the error space where the minimization occurred. It is more interesting to notice that the OIA and Gauss-Newton algorithms produce almost identical results at the two noise levels tested. Additionally, the OIA is much slower than the Gauss-Newton. When the number of points is increased, the added runtime to the OIA is noticeable as opposed to the Gauss-Newton which stays about the same. This is because of the data reduction within the “reduced measurement matrices”. 104 7.1.2 Best Algorithm Combinations Group 5 1. 3-Simplex + OIA (IS) 2. 3-Simplex + OIA (OS) 3. 3-Simplex + Gauss-Newton (IS) 4. 3-Simplex + Gauss-Newton (OS) Figure 7.2: “Best Algorithm Combinations” In our pose estimation system, an initial guess is always taken from the EKF, except at times when tracking has been lost and initialization during startup. In the latter case, an initial guess must be taken from an image. For this reason, it is necessary to have an algorithm combination that can generate an initial guess and then (hopefully) iterate toward the global minima. Group 5, figure 7.2, consists of the “best algorithm combinations” based on the results above. A question posed earlier was whether or not it matters from which error space an initial guess is given to the OIA. Testing of group 5 addresses this question by the use of the linear simplex method in both image and object error spaces. The 3-Simplex Gauss-Newton is compared against the OIA. Group 5 is tested at much higher noise levels than in the section above. See figure 9.9 in the appendix. The results indicate that pose error and runtime of the OIA are not significantly affected by initial guesses from different error spaces. It was already shown that at low noise levels the two iterative algorithms produce equivalent results, but the GaussNewton is much faster. However, as shown in the results by a greater number of outliers, the Gauss-Newton algorithm tends to converge less successfully than the OIA when noise is increased. At σuv = 25 the results still look identical. At σuv = 50 convergence of the Gauss-Newton has begun to worsen, and then at σuv = 75 the Gauss-Newton is clearly much worse than the OIA. It is important to note that these noise level are much higher than we would ever expect to encounter, but it illustrates the point that the OIA has very good global convergence properties. In other words, in our system the OIA proof of convergence was more important than the asymptotic efficiency of the Gauss-Newton. 105 7.1.3 Camera Uncertainty To verify the pose uncertainty calculations of chapter five, a Monte Carlo method is used. Recall that the uncertainty is equivalent to the variance of the error between the true and estimated pose. If the uncertainty calculation performs well, it should predict a covariance equal to the covariance obtained through repeated sampling in the Monte Carlo simulations. In the computer algorithm simulations of the last section, completely new data is generated at each iteration. Here, the same pose and 3D point set are retained for the duration of the simulation and only random image point noise is generated at each trial. The number of trials needed to approximate the true covariance was found by inspecting the variance of the data and seeing that it converged around 1000 trials. Nonlinear least-squares computer vision algorithms with Euler angle parameterization minimize a residual in image or object space. The algorithms are provided the true pose as an initial guess so that they always converge properly to the global minima. Recall from chapter five that the uncertainty calculation assumes an error function has been minimized to the global minima; hence, it converged properly. All simulations use a set of 25 model points. The results are shown in figures 7.3 through 7.6. Each table contains actual and estimation covariances for three different trials. The first, second, and third trials have 5, 7.5, and 10 pixel noise, respectively. All four figures indicate that the uncertainty calculations do in fact accurately estimate covariance. Trial, σ Estimated Actual α 0.0469 0.0487 β 0.0212 0.0218 γ 0.0439 0.0443 tx 5.7136 5.7620 ty 5.2124 5.2909 tz 6.6508 6.9735 Estimated Actual 0.0220 0.0222 0.0179 0.0168 0.0168 0.0158 4.7803 5.0047 6.3891 6.1433 6.2329 6.2856 Estimated Actual 0.0396 0.0303 0.0223 0.0231 0.0378 0.0261 7.7118 6.9733 8.6137 6.0114 9.8118 7.4261 Figure 7.3: Camera Uncertainty, Image Space, Euler Angles 106 Trial, σ Estimated Actual q0 0.0077 0.0075 qx 0.0101 0.0103 qy 0.0101 0.0105 qz 0.0077 0.0072 tx 5.7136 5.7620 ty 5.2124 5.2909 tz 6.6508 6.9735 Estimated Actual 0.0097 0.0098 0.0057 0.0050 0.0056 0.0052 0.0101 0.0099 4.7803 5.0047 6.3891 6.1433 6.2329 6.2856 Estimated Actual 0.0135 0.0109 0.0115 0.0083 0.0130 0.0115 0.0132 0.0106 7.7118 6.9733 8.6137 6.0114 9.8118 7.4261 Figure 7.4: Camera Uncertainty, Image Space, Quaternion Trial, σ Estimated Actual α 0.0475 0.0486 β 0.0217 0.0213 γ 0.0447 0.0459 tx 5.8582 5.7363 ty 5.2913 5.2787 tz 6.7576 6.6861 Estimated Actual 0.0222 0.0224 0.0181 0.0164 0.0169 0.0158 4.8155 4.9306 6.4386 5.9908 6.2976 6.3086 Estimated Actual 0.0403 0.0296 0.0224 0.0234 0.0385 0.0284 7.7852 7.1507 8.7470 6.2901 9.9881 7.0801 Figure 7.5: Camera Uncertainty, Object Space, Euler Angles Trial, σ Estimated Actual q0 0.0079 0.0079 qx 0.0103 0.0102 qy 0.0104 0.0101 qz 0.0079 0.0076 tx 5.8579 5.7363 ty 5.2916 5.2787 tz 6.7578 6.6861 Estimated Actual 0.0098 0.0096 0.0058 0.0050 0.0056 0.0053 0.0102 0.0099 4.8153 4.9306 6.4386 5.9908 6.2969 6.3086 Estimated Actual 0.0136 0.0114 0.0117 0.0084 0.0133 0.0114 0.0135 0.0111 7.7862 7.1507 8.7453 6.2901 9.9881 7.0801 Figure 7.6: Camera Uncertainty, Object Space, Quaternion 107 Figure 7.7: Overlay Test Platform 7.2 Overlay Test Platform Unlike other applications where absolute position and orientation error are important, the success of a pose estimation system within AR may be measured by overlay accuracy. In other words, it is only important to analyze how “good” the overlay looks to a user. Goodness is defined by the degree of proper alignment of virtual graphics with the real-world. For this reason, an overlay test platform, figure 7.7, was created to clearly indicated misalignment. The grid of red dots represents a virtual object which should be aligned with the grid of blue dots. In real-time, the screen of the video see-through display is recorded to file for off-line analysis. Using basic image processing techniques, the red and blue dots are automatically segmented within all video frames. Misalignment occurs when a red dot does not lie exactly in the middle of four blue dots. An error in pixels is related to an actual distance error by a simple scale factor. The results of figures 7.8 and 7.9 show that the alignment error has about 1.5” standard deviation. Additionally, the results indicate that mean of the error is nonzero and therefore a small bias or miscalibration must exist. 108 Trial # 1 2 3 µx 0.6719 0.5030 0.6635 µy -0.5526 -0.4178 -0.3707 σx 1.3154 1.4680 1.3901 σy 1.5251 1.2923 1.4203 Figure 7.8: Overlay Accuracy, without Gyros (Inches) Trial # 1 2 3 µx 0.3352 0.4105 0.4591 µy -0.3235 -0.2114 -0.3156 σx 0.8382 1.2336 1.2451 σy 1.0786 1.3096 1.1880 Figure 7.9: Overlay Accuracy, with Gyros (inches) 7.3 Human Testing - Realism Ultimately, for an augmented reality system to be successful users must perceive the integration of virtual graphics with the real-world to be as realistic as possible. In addition to correct registration, the fidelity of virtual objects themselves contributes to the realism. For example, a 3D model might appear “blocky” or coarse due to low polygon count, or its brightness and shadowing might not match the real environment. Furthermore, the AR display might have low refresh rate, resolution, or luminosity. All of these factors and more may negatively affect the illusion/realism of virtual objects: jitter, drift/alignment, lag/delay, unnatural, 2-dimensional. For these reasons a realism questionnaire was completed by CSM students who used the AR system. It is partly based on the work of [32] which measures three different human factors components. “Realness” relates to how real the virtual objects seemed and how well they integrated with the real objects. “Spatial presence” relates to the 3D-ness of the virtual objects. “Perceptual Stress” taps experiences of the perceptual process itself - whether the difference between real and virtual drew attention, and whether the perception of the virtual object needed effort. The participants answered realism questions with the gyros enabled and disabled, columns two and three of figure 7.10. All verbal feedback indicated the need for the IMU. 109 Users stated that without the gyros they found it very distracting when the cones were not in view because the pose of the virtual model quit updating all together. With Gyros I had the impression that virtual objects fit naturally 89.4608 into the real world? I had the impression that I could have touched and 86.3899 grasped the virtual object? I had the impression that virtual objects were located 91.0677 in 3-dimensional space, not just appearing as flat images on the screen? Small distractions did not draw my attention away 87.1483 from the integration of virtual objects into the real world? I did not have to make an effort to recognize the vir- 96.4061 tual objects as being three-dimensional? Figure 7.10: Realism Questionnaire 110 Without Gyros 83.3103 82.1588 89.1395 82.9315 95.6133 Chapter 8 Summary and Conclusions In summary, a new augmented reality system was constructed from low cost off-the-shelf components. An 8.4” LCD demonstrates handheld video see-through technology. All software was written in standard C++ with free or open source libraries. Three different 3D models exhibit AR capabilities within four application areas: C130 Hercules - training, Kuka manipulator - education and teleoperation support, matlab peaks plot - 3D data visualization. An original vision and inertial pose estimation system was designed and implemented. The system relies on input from two sensors; camera updates occur at 30Hz, and IMU updates at 100Hz. A new calibration routine is created for relative sensor orientation using gravity vectors, extrinsic camera parameters, and an orthogonal matrix method. Traffic cones are used as 3D fiducials. The system is scalable because any size cone can be used, 4” to 36”. The use of 3D fiducials creates a much larger work volume than planar targets. The cones are also inexpensive, durable, stackable, and may be quickly placed for outdoor use. A quaternion-based EKF was implemented to filter sensor measurements and avoid singularities found in other rotation parameterizations. A graphical user interface allows filter parameters to be tuned during real-time operation. Additionally, sensors may be enabled and disabled. A new object space pose uncertainty algorithm with quaternion constraint was designed and verified with a Monte Carlo method. This method is needed for systems that rely on a quaternion-based estimator and object space algorithms such as the Orthogonal 111 Iteration Algorithm. Several new computer vision algorithms were created and tested in Matlab. The new algorithms were compared against popular existing algorithms. Results show several linear techniques that outperform the DLT method. Gauss-Newton algorithms were shown to be much faster than the Orthogonal Iteration Algorithm but fail to converge occasionally. The 3-Simplex SDP shows promising results but further analysis in needed to conclusively state its convergence properties. A new 6-DOF overlay test platform was created for measuring and analyzing overlay error in the end-to-end system. Virtual and real features are segmented from a video stream to measure alignment error. Results clearly indicate low static and dynamic registration error. Students completed a “realism” questionnaire, with and without the gyros enabled. Responses indicated the need for the IMU at times when fiducials are not visible. The responses also helped indicated areas of improvement. It is inevitable that the user will look away from the fiducials. During these times a complementary sensor is needed so that registration error does not grow excessively large to the point where the user is negatively distracted. The next generation system should have a lower noise IMU so that accelerometers can be integrated into the EKF for updating translation. To reduce motion restrictions caused by data and power cables between the sensor/LCD head and workstation, all computing and power should be moved into a backpack and run on a laptop. The quality of video see-though display could be enhanced by undistorting the entire video stream instead of only feature points. Cone segmentation accuracy could likely be improved by using a more complex color classification method and implementing a predictive template matching scheme. Line detection might also help locate cones within an image. The 3-Simplex SDP algorithm should be studied more carefully to understand a few key steps including optimal selection of simplex vertices, optimal combination of distance and dot product constraints, and how to choose the number of eigenvectors. 112 Bibliography [1] F. Ababsa, M. Mallem, and D. Roussel. Comparision between particle filter approach and kalman filter-based technique for head tracking in augmented reality systems, 2004. [2] Y. Andel-Aziz and H. Karara. Direct linear transform from comparator coordinates into object space coordinates in close-range photogrammetry, 1978. [3] A. Ansar and K. Daniilidis. Linear pose estimation from points or lines, 2003. [4] R. Azuma. A survey of augmented reality, 1997. [5] R. Azuma and et al. Recent advances in augmented reality, 2001. [6] J. Chou. Quaternion kinematic and dynamic differential equations, 1992. [7] D. Choukroun, Y. Oshman, and I. Bar-Itzkack. Novel quaternion kalman filter, 2006. [8] J. Craig. Introduction to robotics: Mechanics and control, 2004. [9] E. Dam, M. Koch, and M. Lillholm. Quaternions, interpolation, and animation, 1998. [10] D. Drascic and P. Milgram. Perceptual issues in augmented reality, 1996. [11] D. Eberly. Rotation representations and performance issues, 2002. [12] P. Fiore. Efficient linear solution of exterior orientation, 2001. [13] M. Fischler and R. Bolles. Random sample and consensus: A paradigm for model fitting with applications to image analysis and automated cartography, 1981. [14] R. Hartley. Minimizing algebraic error in geometric estimation problems, 1998. [15] J. Heikkila and O. Silven. Calibration procedure for short focal length off-the-shelf ccd cameras, 1996. 113 [16] J. Heikkila and O. Silven. A four-step camera calibration procedure with implicit image correction, 1997. [17] R. Holloway. Registration error analysis for augmented reality, 1997. [18] R. Horaud, B. Conio, and O. Leboulleux. An analytic solution for the perspective 4-point problem, 1989. [19] B. Horn. Closed-form solution of absolute orientation using unit quaternions, 1986. [20] B. Horn, H. Hilden, and S. Negahdariour. Closed-form solution of absolute orientation using orthonormal matrices, 1988. [21] J. LaViola Jr. A comparision of the unscented and extended kalman filtering of estimation quaternion motion, 2003. [22] S. Julier and J. Uhlmann. A new extension of the kalman filter to nonlinear systems, 1997. [23] J. Lasserre. Global optimization with polynomials and the problem of moments, 2001. [24] P. Lee and J. Moore. Gauss-newton-on-manifold for pose estimation, 2005. [25] C. Lu, G. Hager, and E. Mjolsness. Fast and globally convergent pose estimation from video images, 2000. [26] J. Maida, C. Bowen, and J. Pace. Improving robotic operator performance using augmented reality, 2007. [27] F. Moreno-Noguer, P. Fua, and V. Lepetit. Accurate non-iterative o(n) solution to the pnp problem, 2007. [28] U. Neumann and A. Majoros. Cognitive, performance, and systems issues for augmented reality applications in manufacturing and maintenance, 1998. [29] K. Nguyen. Inertial data fusion using kalman filter methods for augmented reality, 1998. [30] E. Pervin and J. Webb. Quaternions in computer vision and robotics, 1982. [31] L. Quan and Z. Lan. Linear n-point camera pose determination, 1999. 114 [32] H. Regenbrecht and T. Schubert. Measuring presence in augmented reality environments: Design and a first test of a questionnaire, 2002. [33] K. Shoemake. Quaternions, 2007. [34] K. Stanney, R. Mourant, and R. Kennedy. Human factors issues in virtual environments: A review of the literature, 1998. [35] S. Thrun, W. Burgard, and D. Fox. Probabilistic robotics, 2001. [36] N. Trawny and S. Roumeliotis. Indirect kalman filter for 3d attitude estimation, 2005. [37] B. Triggs. Camera pose and calibration from 4 or 5 known 3d points, 1999. [38] R. Tsai. A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses, 1987. [39] A. Ude. Filtering in a unit quaternion space for model-based object tracking, 1999. [40] S. Umeyama. Least-squares estimation of transformation parameters between two point patterns, 1991. [41] T. Vincent and W. Hoff. Analysis of head pose accuracy in augmented reality, 2000. [42] W. Wolfe, D. Mathis, C. Sklair, and M. Magee. The perspective view of three points, 1991. [43] R. Zanetti and R. Bishop. Quaternion estimation and norm constrained kalman filtering, 2006. 115 Chapter 9 Appendix 9.1 DLT Parameters D = −(r31 X0 + r32 Y0 + r33 Z0 ) u0 r31 − fu r11 u0 r32 − fu r12 u0 r33 − fu r13 , l2 = , l3 = D D D (9.2) (fu r11 − u0 r31 )X0 + (fu r12 − u0 r32 )Y0 + (fu r13 − u0 r33 )Z0 D (9.3) v0 r31 − fv r21 v0 r32 − fv r22 v0 r33 − fv r23 , l6 = , l7 = D D D (9.4) (fv r21 − v0 r31 )X0 + (fv r22 − v0 r32 )Y0 + (fv r23 − v0 r33 )Z0 D (9.5) r31 r32 r33 , l10 = , l11 = D D D (9.6) l1 = l4 = l5 = l8 = (9.1) l9 = l1 X0 + l2 Y0 + l3 Z0 = −l4 l5 X0 + l6 Y0 + l7 Z0 = −l8 l9 X0 + l10 Y0 + l11 Z0 = −1 116 (9.7) Intrinsic Parameters D2 = 1 l92 + 2 l10 (9.8) 2 + l11 u0 = D2 (l1 l9 + l2 l10 + l3 l11 ) (9.9) v0 = D2 (l5 l9 + l6 l10 + l7 l11 ) fu2 = D2 (u0 l9 − l1 )2 + (u0 l10 − l2 )2 + (u0 l11 − l3 )2 fv2 = D2 (v0 l9 − l5 )2 + (v0 l10 − l6 )2 + (v0 l11 − l7 )2 9.2 9.2.1 (9.10) Simplex SDP Three Variable/Vector Case Distance Constraint 2 dl = (βa vai + βb vbi + βc vci ) − (βa vaj + βb vbj + βc vcj ) C (9.11) Dot Product Constraint C dl = (βa vai + βb vbi + βc vci ) − (βa vak + βb vbk + βc vck ) · (βa vaj + βb vbj + βc vcj ) − (βa vak + βb vbk + βc vck ) (9.12) Objective Function l l l l l l βa βc + C011 βb βc ) − βb2 + C002 βc2 + C110 βa βb + C101 El = (C200 βa2 + C020 LMI Constraints 117 M dl 2 (9.13) h iT X1 = βa βb βc (9.14) h iT X2 = 1 βa2 βb2 βc2 βa βb βa βc βb βc (9.15) # βa2 − βb2 0 M0 = 0 βa2 − βc2 2 βa βa βb βa βc M1 = βa βb βb2 βb βc βa βc βb βc βc2 βa βb βa βc βb βc βb2 βc2 1 βa2 2 βa βa3 βc βa2 βb βc βa3 βb βa2 βc2 βa2 βb2 βa4 2 2 4 2 2 3 2 3 β2 β β β β β β β β β β β β a a c c a c b b b b b b b 2 M2 = βc βb βc3 βa βb βc2 βa βc3 βc4 βb2 βc2 βa2 βc2 βa βb β 3 βb βa βb3 βa βb βc2 βa2 βb2 βa2 βb βc βa βb2 βc a 2 2 2 2 2 3 3 βa βc βa βc βa βb βc βa βc βa βb βc βa βc βa βb βc βb βc βa2 βb βc βb3 βc βb βc3 βa βb2 βc βa βb βc2 βb2 βc2 (9.16) " 9.2.2 Four Variable/Vector Case Distance Constraint 2 dl = (βa vai + βb vbi + βc vci + βd vdi ) − (βa vaj + βb vbj + βc vcj + βd vdj ) C (9.17) Dot Product Constraint C l d = (βa vai + βb vbi + βc vci + βd vdi ) − (βa vak + βb vbk + βc vck + βd vdk ) · (βa vaj + βb vbj + βc vcj + βd vdj ) − (βa vak + βb vbk + βc vck + βd vdk ) Objective Function 118 (9.18) l l l l βd2 βc2 + C0002 βb2 + C0020 βa2 + C0200 El = ((C2000 l l l l l l βb βd + C0011 βc βd ) − βb βc + C0101 βa βd + C0110 βa βc + C1001 βa βb + C1010 + C1100 M dl )2 (9.19) LMI Constraints. Matrix M2 is too large to fit on the page! h iT X1 = βa βb βc βd (9.20) h iT X2 = 1 βa2 βb2 βc2 βd2 βa βb βa βc βa βd βb βc βb βd βc βd (9.21) 2 0 βa − βb2 M0 = 0 βa2 − βc2 0 0 βa2 βa βb βa βc βa βb βb2 βb βc M1 = β β β β 2 a c b c βc βa βd βb βd βc βd 9.3 0 0 βa2 − βd2 βa βd βb βd βc βd 2 βd (9.22) Chi-Square Distribution χ2k = k X (x̂ − xi )2 σx2 i=1 (9.23) The PDF and CDF of the chi-square distribution... xk/2−1 e−x/2 , x > 0 2k/2 Γ(k/2) fχ (x, k) = 0, x≤0 Fχ (x, k) = γ(k/2, x/2) Γ(k/2) (9.24) (9.25) The gamma, upper incomplete gamma, and lower incomplete gamma functions... 119 ∞ Z ta−1 e−t dt Γ(a) = (9.26) 0 ∞ Z Γ(a, x) = ta−1 e−t dt (9.27) ta−1 e−t dt (9.28) x Z x γ(a, x) = 0 120 9.4 Computer Vision Algorithm Comparison Figure 9.1: Group 1, 10 Points, Top: σuv = 5, Bottom: σuv = 10 121 Figure 9.2: Group 1, 25 Points, Top: σuv = 5, Bottom: σuv = 10 122 Figure 9.3: Group 2, 10 Points, Top: σuv = 5, Bottom: σuv = 10 123 Figure 9.4: Group 2, 25 Points, Top: σuv = 5, Bottom: σuv = 10 124 Figure 9.5: Group 3, 10 Points, Top: σuv = 5, Bottom: σuv = 10 125 Figure 9.6: Group 3, 25 Points, Top: σuv = 5, Bottom: σuv = 10 126 Figure 9.7: Group 4, 10 Points, Top: σuv = 5, Bottom: σuv = 10 127 Figure 9.8: Group 4, 25 Points, Top: σuv = 5, Bottom: σuv = 10 128 Figure 9.9: Group 5, 10 Points, Top: σuv = 25, Middle: σuv = 50, Bottom: σuv = 75 129