On Environment Modeling for Visual Navigation Zhigang Zhu Ph.D. Thesis at Department of Computer Science and Technology Tsinghua University, Beijing (1997) Table of Contents Abstract Acknowledgments 1 Problems and Solutions 1.1 Problems 1.2 Background 1.2.1 The Marr Paradigm 1.2.2 The DARPA Visual Navigation Efforts 1.2.3 Obstacle Detection Techniques 1.2.4 Omnidirectional Vision and Panoramic Vision 1.3 Our Approach: Multi-Scale and Full View Vision 1.3.1 Overview 1.3.2 Contributions 1.3.3 Organizations 2 Panoramic Vision for Landmark Recognition 2.1 Introduction 2.2 Motion Filtering and Image Stabilization 2.2.1 Vehicular Motion Model (Appendix 2.1) 2.2.2 Motion Estimation and Image Rectification ( Appendix 2.2) 2.2.3 Motion Filtering Algorithms 2.2.4 PVI and EPI generation: Experimental Results 2.3 Panoramic EPI Analysis Approach 2.3.1 Motion Texture and Motion Occlusion Models (Appendix 2.3) 2.3.2 GFOD: Large Gaussian Fourier Orientation Detector (Appendix 2.4) 2.3.3 Depth Belief Map and Data Selection 2.3.4 Motion Boundary Localization and Depth Interpolation 2.4 Panoramic Modeling and Generalized Landmark Selection 2.4.1 Image Rectification and Stabilization 2.4.2 Panoramic Depth Acquisition: Parallel Processing 2.4.3 Fusion of Depth and Intensity Maps 2.4.4 Generalized Landmark Selection 2.5 Summary and Discussions (Philosophy, Advantages and Limitations) 3 Omnidirectional Vision for Road Understanding 3.1 Introduction 3.2 Omnidiretional Vision-Based Eigenspace Representation 3.2.1. Problem Statement 3.2.2 Eigenspace Representation 3.2.3 Omnidirectional Eigenspace Representation 3.3 Real-Time Omnidirectional Imaging Sensor 3.3.1 Sensor Geometry (Appendix 3.1) 3.3.2 Practical Imaging System and System Calibration 3.3.4. Ground Projection and Image Rectification 3.3.5 Ground Feature Analysis in Polar Coordinate System (Appendix 3.2) 3.4 Rotational-Invariant Feature Space of Omnidirectional Images 3.4.1 Radial Pinciple Component Analysis (PCA) 3.4.2 Orientation Fourier Transform 3.4.3 Orientation Estimation 3.5 OVINN: Omnidirection Vision-based Neural Networks 3.5.1 Problem Statment 3.5.2 The Model of the ROVINN 3.5.3 Implementation Issues 1 3.6 Experimental Results and Analysis 3.6.1 System Architecture 3.6.2 Data Collection 3.6.3 Training and Testing 3.7 Summary and Discussions 4 Stereo Vision for Obstacle Detection 4.1 Introdcution 4.2 Principle of Planar Gaze Transformation 4.2.1 Theory (Appendix 4.1) 4.2.2 Properties (Appendix 4.2) 4.2.3 Applications 4.3 Binocular Vision System Using a Single Camera 4.3.1 Design I: the Left-Right Partition (How to make it compact) 4.3.2 Design II: the Up_Down Partition (How to have a wide FOV) 4.4 Real-Time Obstacle Detection Algorithm 4.4.1 Goal and Assumptions 4.4.2 Statistical Modeling (Appendix 4.3) 4.4.3 The Basic Algorithm 4.4.4 Implementation and Performance (Appendix 4.4) 4.5 Dynamic Gaze Transformation 4.5.1 Iterative Approach 4.5.2 Image Stabilization Method 4.5.3 Generalized HOUGH Transform Approach 4.6 Summary and Discussion 5 POST: a Multi-Scale and Full-View Vision Approach 5.1 Scene Modeling and Interconnection: a Systems Approach 5.2 Sensing Intergration: a Compact Full-View Visual Sensor 5.3 Data Integration and Interconnection among Sub-Systems 5.4 Human-Robot Ineraction and Vision Enhancements 2 5.5 Conclusions and Discussions 6. Conclusions and Future Directions Appendices A-2.1 A Generalized Vehiclar Motion Model 1 Motion with Changing Speed 2 Motion Along a Curved Path A-2.2 Image Matching and Motion Estimation A-2.3 Proof of Motion Occlusion Model A-2.4 GFOD Fast Algorithm A-3.1 Geometry of Omnidirectional Imaging A-3.2 Cylindrical Projection A-4.1. Proof of Gaze Transformation Geometry A-4.2. Proof of the Reprojected Disparity Bibliography 3 Extended Summary Panoramic/omnidirectional representations of image sequences have a wide application scope, including robot navigation, virtual reality, interactive 2D/3D video, content-based video compression, and full-view video surveillance. Scene modeling using image mosaicing and panoramic/omnidirectional vision has attracted great attentions in the fields of computer vision and computer graphics in recent years. Usually, researchers either focus on the analysis/recognition part (e.g., panoramic /omnidirectional vision for robot navigation), or on the synthesis/visualization part (e.g. image mosaicing, panoramic and layered representation in image-based rendering or virtualized reality). This thesis makes a first attempt to systematically bring the two seemingly quite different topics under a single umbrella of “visual modeling and presentation”. Human Real World Living (Man) (Scene) VR HCI AI AR Virtual World Robot (Computer) Simulation (Computer) Visual scene modeling Fig. 1. Interactions diagrams. (HCI: Human-Computer Interaction; VR: Virtual/Virtualized Reality; AI: Artificial Intelligence / Visual Navigation; AR: Augmented Reality) First, let us have a look at the two topics – robot navigation and virtualized reality -- in a broader perspective of interaction between "being" and "environment" (Fig. 1). We can find a very close resemblance between them: robot navigation is the interaction between a robot (i.e. a digital being) and the real 3D world, while virtual/virtualized reality is the interaction between a person (i.e. a human being) and a virtual/virtualized environment. If we limit our discussion of the "interaction" to visual perception, the central problem that needs to be solved for these two kinds of interactions is 4 visual scene modeling and representation in a computer – either inside the “mind” of a robot or outside the mind of a human being (Fig. 1). Second, a closer examination of the research efforts of the past ten years shows that techniques and representations for the two applications are surprisingly similar. Graphics people talk about multiperspective projection for image-based rendering of large-scale scenes, while vision people try to use the concept of spatio-temporal panoramic view images in robot localization and landmark recognition. Vision/robotics people take advantage of the 360-degree view angle of omnidirectional images for map building, road following, obstacle detection in robot navigation, whereas graphics people try to generate omnidirectional image representations for image-based rendering. My own research also shows that we can use the exact same basic methodology for building and the same structures for representing visual scene models for both robot navigation and image-based rendering. Finally, we can find a class of interesting applications for integrating these two kinds of models: a human-robot intelligent navigation (HRIN) system, such as a semi-autonomous mobile robot for mail delivery, military surveillance and intelligent transportation. In a HRIN system, the robot will automatically carry out most of the basic tasks such as road following, obstacle detection, and target localization, while a human supervisor will make important decisions or deal with some emergent situations via augmented reality and tele-operation. Thus a unified model that includes both the symbolic environment model for navigation and the photorealistic scene model for visualization is required. Needless to say, visual navigation of a mobile robot in a natural environment has always been a very interesting but challenging problem. It involves almost every aspect of computer vision research - from visual sensors through robust algorithms to visual representations. The basic requirements of visual navigation include global localization (to decide where to go), road following (to stay on the road) and obstacle detection (to avoid collision). Only after these safety requirements have been satisfied, which have been proven to be not a trivial problem, can the robot pursue other task-oriented goals. It is clear that visual environment modeling is the foundation of these basic issues in visual navigation - and it may extend to most of the real world problems in computer vision. This work presents a systematic approach to visual modeling of a natural scene for robot navigation: 5 t Spatio-temporal panoramic view left side view road side scene left front view v road surface right front view omnidirectional view right side view Fig. 2. Full view vision for robot navigation 1. A purposive, multi-scale and full-view visual scene modeling approach is proposed for visual navigation in a natural environment (Chapter 1 - Chapter 5). As a typical instance, an integrated system POST is proposed which combines three novel modules (Fig. 2): Panoramic vision for landmark recognition, Omnidirectional vision for road understanding and STereo vision for obstacle detection. This approach tries to overcome the drawbacks of traditional visual navigation methods that have mostly depended on local and/or single view visual information. However, the proposed approach is not just a simple combination of the three novel sensors and methods, but rather a systematic integration under the strategy of purposive vision (“the right way for the right work”), and under the philosophy of a systems approach which emphasizes that “the whole is more than sum of its components”. Thus, correct sensor design, adequate levels of scene representation and corresponding robust and fast algorithms are specifically explored for each given task while the interconnection among the vision sub-systems are taken into consideration under the overall goal of autonomous navigation. The human-robot cooperation in different navigation modes (autonomous, semi-autonomous and tele-operational) and different levels of vision enhancements (video enhancement, stereo enhancement, view enhancement, information enhancement and virtualized reality) will be discussed. 6 (1) panoramic texture map Horizontal wedge and a row of flags building façade and steps Pine tree trees depth changes in the wall building bridge pedestrian pine tree and bamboo (2) panoramic depth map (3) parallel projection of the 3D panorama Fig. 3. 3D Panoramic representation for landmark selection 2. A two-stage method is presented for 3D panoramic scene modeling for landmark selection (Chapter 2). As inputs, image sequences are captured by a video camera subject to small but unpredictable fluctuation on a common road surface. First, a 3D image stabilization method is proposed which eliminates fluctuation from the vehicle’s smooth motion so that "seamless" panoramic view images (PVIs) and epipolar plane images (EPIs) can be generated. Second, an efficient panoramic EPI analysis method is proposed to combine the advantages of both PVIs and EPIs efficiently in two important steps: frequency domain locus orientation detection, and spatiotemporal domain motion boundary localization. The two-stage method not only combines ZhengTsuji’s PVI method with Bolle-Baker’s EPI analysis, resulting in the so-called panoramic EPI method, but also generalizes them to handle image sequences subject to small but unpredictable camera fluctuations. Since camera calibration, image segmentation, feature extraction and matching are completely avoided, all the proposed algorithms are fully automatic and rather general. Finally, a compact representation in the form of a 3D panorama for a large-scale scene is constructed that can be used effectively for generalized landmark selection for robot navigation (Fig. 3). This method will 7 further be applied in image-based rendering. Panoramic Info. ( robot location) steering DFM Binocular Info. (obstacles) Data Fusion road class c RCN road orientation n RONs c R1 {a(k)} R2 … RC {u(n)} IPM DFT PCA Polar Trans. OmniView Sensor Fig. 4. . ROVINN architecture and the interconnection with other two modules ( RCN: Road Classification Network; RON: Road Orientation Network; DFM: Data Fusion Module; IPM: Image Processing Module; DFT: Discrete Fourier Transform; PCA: Principal Component Analysis) 3. A new road following approach, the Road Omni-View Image Neural Networks (ROVINN), has been proposed (Chapter 3). It combines the omnidirectional image sensing technique with neural networks in such a manner that the robot is able to learn recognition and steering knowledge from the omnidirectional road images that in turn guarantee that the robot will never miss the road. The ROVINN approach brings Yagi’s COPIS (conic omnidirectional projection image sensor) method to outdoor road scenes and provides an alternative solution different from CMU’s ALVINN system. Compact and rotation-invariant image features are extracted by integrating an omnidirectional eigenspace representation with frequency analysis, using principal component analysis (PCA) and Fourier transforms (DFTs). The modular neural networks of the ROVINN estimate road orientations more robustly and efficiently by classifying the roads as a first step, which enables the robot to adapt to various road types automatically. 8 trees a person Fig. 5. Image gaze transformation and obstacle detection. Top: Left and right view in a single camera image; Bottom-left: rectified left image by gaze transformation; Bottom-right: obstacle region after zero-disparity gaze control. The difference image shows that the ground images have been registered. 4. A novel method called the Image Gaze Transformation is presented for stereo-vision-based road obstacle detection (Chapter 4). Obstacle detection is modeled as a reflexive behavior of detecting anything that is different from a planar road surface. Dynamic gaze transformation algorithms are developed so that the algorithms can work on a rough road surface. The novelty of the (dynamic) gaze transformation method, which resembles gaze control of the human vision, lies in the fact that it brings the road surface to zero disparity so that the feature extraction and matching procedures of traditional stereo vision methods are completely avoided in the proposed obstacle detection algorithms. The progressive processing strategy from yes/no verification, through focus of attention, to 3D measurement based on the reprojection transformation make the hierarchical obstacle detection techniques efficient, fast and robust. 9 To validate the proposed strategies and methods, we have implemented the following algorithms and systems. (1) Design of novel sensors. An omni-view image sensor is designed and realized (Fig. 6), and its properties for outdoor road understanding are thoroughly studied. A patented single camera binocular vision system with full horizontal field of view is also designed and constructed where left and right views are projected to the up- and bottom halves of a single image (Fig. 7). It has been put into real road application for obstacle detection. An inexpensive and integrated full view smart sensor POST (Panoramic, Omnidirectional and STereo vision sensor) is proposed, which integrates a 360-degree omnidirectional view with a binocular forward view as well as both left and right side views by using a single camera and a set of reflection mirrors (Fig. 8). (2) Real scene experiments. Experimental results of training and testing the ROVINN using real road images have shown that the proposed method for road following is quite promising. A real-time visual obstacle detection system has been set up and extensively tested on outdoor road scenes. (3) 3D Scene modeling system. In the 3D panoramic scene modeling system (Fig. 9), the algorithms for motion filtering and image stabilization, kinetic occlusion detection and depth layering have been developed, and 3D layered panoramic models have been constructed for many image sequences. These efforts form the basic framework for both global localization using generalized landmark selection, and the synthesis of photo-realistic image-based renderings. 10 (1). A prototype of the OVI sensor (2). An omnidirectional image Fig 6. Omni-view image (OVI) sensor L1 R1 L2 R2 real camera L1, R1, L2, R2 are mirrors (1) System geometry (top view) (2) A real binocular image pair Fig. 7. Single camera stereo vision system: left and right views are projected to the top and bottom halves of a single image 11 C conic mirror MR ML R1 L1 L2 PR PL R2 P O SL Sensor Target SR L1,R1,L2,R2,MR,ML,P are planar mirrors, C is a conic mirror. PR and PL are the virtual left and right side view "cameras", SL and SR are the two virtual binocular front view "cameras", and O is the virtual omniview "camera" looking at the conic mirror. The real camera (shown as an illustrative sensor target) is perpendicularly pointing to the paper. (1). POST: an integrated full view vision sensor ST image (PVI) Binocular front views (2 * 256*128) t Omnidirectional view (60 pixel * 360°) t Left side view (80*480) Right side view (80*480) (2). A composite image (640x480 image) Fig 8. Integrated full view vision sensor: POST 12 video Depth Map Acquisition video stabilization EPI 1 orientation EPI 2 &motion boundary PVI & EPI generation Panoramic texture map EPI H H: height of a frame …… Panoramic depth map Texture-depth fusion Landmark selection Panoramic Landmark Model visual navigation Occlusion recovery & depth layering Relief-like LAMP Image-based LAMP Image-based rendering Fig. 9 System diagram of 3D panoramic scene modeling (PVI: Panoramic View image; EPI: Epipolar Plane Image; LAMP: Layered, Adaptive-resolution and Multi-perspective Panorama) 13