On Environment Modeling for Visual Navigation

advertisement
On Environment Modeling for Visual Navigation
Zhigang Zhu
Ph.D. Thesis
at
Department of Computer Science and Technology
Tsinghua University, Beijing
(1997)
Table of Contents
Abstract
Acknowledgments
1 Problems and Solutions
1.1 Problems
1.2 Background
1.2.1 The Marr Paradigm
1.2.2 The DARPA Visual Navigation Efforts
1.2.3 Obstacle Detection Techniques
1.2.4 Omnidirectional Vision and Panoramic Vision
1.3 Our Approach: Multi-Scale and Full View Vision
1.3.1 Overview
1.3.2 Contributions
1.3.3 Organizations
2 Panoramic Vision for Landmark Recognition
2.1 Introduction
2.2
Motion Filtering and Image Stabilization
2.2.1 Vehicular Motion Model (Appendix 2.1)
2.2.2 Motion Estimation and Image Rectification ( Appendix 2.2)
2.2.3 Motion Filtering Algorithms
2.2.4 PVI and EPI generation: Experimental Results
2.3 Panoramic EPI Analysis Approach
2.3.1 Motion Texture and Motion Occlusion Models (Appendix 2.3)
2.3.2 GFOD: Large Gaussian Fourier Orientation Detector (Appendix 2.4)
2.3.3 Depth Belief Map and Data Selection
2.3.4 Motion Boundary Localization and Depth Interpolation
2.4 Panoramic Modeling and Generalized Landmark Selection
2.4.1 Image Rectification and Stabilization
2.4.2 Panoramic Depth Acquisition: Parallel Processing
2.4.3 Fusion of Depth and Intensity Maps
2.4.4 Generalized Landmark Selection
2.5 Summary and Discussions (Philosophy, Advantages and Limitations)
3 Omnidirectional Vision for Road Understanding
3.1 Introduction
3.2 Omnidiretional Vision-Based Eigenspace Representation
3.2.1. Problem Statement
3.2.2 Eigenspace Representation
3.2.3 Omnidirectional Eigenspace Representation
3.3 Real-Time Omnidirectional Imaging Sensor
3.3.1 Sensor Geometry (Appendix 3.1)
3.3.2 Practical Imaging System and System Calibration
3.3.4. Ground Projection and Image Rectification
3.3.5 Ground Feature Analysis in Polar Coordinate System (Appendix 3.2)
3.4 Rotational-Invariant Feature Space of Omnidirectional Images
3.4.1 Radial Pinciple Component Analysis (PCA)
3.4.2 Orientation Fourier Transform
3.4.3 Orientation Estimation
3.5 OVINN: Omnidirection Vision-based Neural Networks
3.5.1 Problem Statment
3.5.2 The Model of the ROVINN
3.5.3 Implementation Issues
1
3.6 Experimental Results and Analysis
3.6.1 System Architecture
3.6.2 Data Collection
3.6.3 Training and Testing
3.7 Summary and Discussions
4 Stereo Vision for Obstacle Detection
4.1 Introdcution
4.2 Principle of Planar Gaze Transformation
4.2.1 Theory (Appendix 4.1)
4.2.2 Properties (Appendix 4.2)
4.2.3 Applications
4.3 Binocular Vision System Using a Single Camera
4.3.1 Design I: the Left-Right Partition (How to make it compact)
4.3.2 Design II: the Up_Down Partition (How to have a wide FOV)
4.4 Real-Time Obstacle Detection Algorithm
4.4.1 Goal and Assumptions
4.4.2 Statistical Modeling (Appendix 4.3)
4.4.3 The Basic Algorithm
4.4.4 Implementation and Performance (Appendix 4.4)
4.5 Dynamic Gaze Transformation
4.5.1 Iterative Approach
4.5.2 Image Stabilization Method
4.5.3 Generalized HOUGH Transform Approach
4.6 Summary and Discussion
5 POST: a Multi-Scale and Full-View Vision Approach
5.1 Scene Modeling and Interconnection: a Systems Approach
5.2 Sensing Intergration: a Compact Full-View Visual Sensor
5.3 Data Integration and Interconnection among Sub-Systems
5.4 Human-Robot Ineraction and Vision Enhancements
2
5.5 Conclusions and Discussions
6. Conclusions and Future Directions
Appendices
A-2.1 A Generalized Vehiclar Motion Model
1 Motion with Changing Speed
2 Motion Along a Curved Path
A-2.2 Image Matching and Motion Estimation
A-2.3 Proof of Motion Occlusion Model
A-2.4 GFOD Fast Algorithm
A-3.1 Geometry of Omnidirectional Imaging
A-3.2 Cylindrical Projection
A-4.1. Proof of Gaze Transformation Geometry
A-4.2. Proof of the Reprojected Disparity
Bibliography
3
Extended Summary
Panoramic/omnidirectional representations of image sequences have a wide application scope,
including robot navigation, virtual reality, interactive 2D/3D video, content-based video compression,
and
full-view
video
surveillance.
Scene
modeling
using
image
mosaicing
and
panoramic/omnidirectional vision has attracted great attentions in the fields of computer vision and
computer graphics in recent years. Usually, researchers either focus on the analysis/recognition part
(e.g., panoramic /omnidirectional vision for robot navigation), or on the synthesis/visualization part
(e.g. image mosaicing, panoramic and layered representation in image-based rendering or virtualized
reality). This thesis makes a first attempt to systematically bring the two seemingly quite different
topics under a single umbrella of “visual modeling and presentation”.
Human
Real
World
Living
(Man)
(Scene)
VR
HCI
AI
AR
Virtual
World
Robot
(Computer)
Simulation
(Computer)
Visual scene modeling
Fig. 1. Interactions diagrams. (HCI: Human-Computer Interaction; VR: Virtual/Virtualized Reality; AI:
Artificial Intelligence / Visual Navigation; AR: Augmented Reality)
First, let us have a look at the two topics – robot navigation and virtualized reality -- in a broader
perspective of interaction between "being" and "environment" (Fig. 1). We can find a very close
resemblance between them: robot navigation is the interaction between a robot (i.e. a digital being)
and the real 3D world, while virtual/virtualized reality is the interaction between a person (i.e. a
human being) and a virtual/virtualized environment. If we limit our discussion of the "interaction" to
visual perception, the central problem that needs to be solved for these two kinds of interactions is
4
visual scene modeling and representation in a computer – either inside the “mind” of a robot or
outside the mind of a human being (Fig. 1).
Second, a closer examination of the research efforts of the past ten years shows that techniques and
representations for the two applications are surprisingly similar. Graphics people talk about multiperspective projection for image-based rendering of large-scale scenes, while vision people try to use
the concept of spatio-temporal panoramic view images in robot localization and landmark
recognition. Vision/robotics people take advantage of the 360-degree view angle of omnidirectional
images for map building, road following, obstacle detection in robot navigation, whereas graphics
people try to generate omnidirectional image representations for image-based rendering. My own
research also shows that we can use the exact same basic methodology for building and the same
structures for representing visual scene models for both robot navigation and image-based rendering.
Finally, we can find a class of interesting applications for integrating these two kinds of models: a
human-robot intelligent navigation (HRIN) system, such as a semi-autonomous mobile robot for
mail delivery, military surveillance and intelligent transportation. In a HRIN system, the robot will
automatically carry out most of the basic tasks such as road following, obstacle detection, and target
localization, while a human supervisor will make important decisions or deal with some emergent
situations via augmented reality and tele-operation. Thus a unified model that includes both the
symbolic environment model for navigation and the photorealistic scene model for visualization is
required.
Needless to say, visual navigation of a mobile robot in a natural environment has always been a
very interesting but challenging problem. It involves almost every aspect of computer vision research
- from visual sensors through robust algorithms to visual representations. The basic requirements of
visual navigation include global localization (to decide where to go), road following (to stay on the
road) and obstacle detection (to avoid collision). Only after these safety requirements have been
satisfied, which have been proven to be not a trivial problem, can the robot pursue other task-oriented
goals. It is clear that visual environment modeling is the foundation of these basic issues in visual
navigation - and it may extend to most of the real world problems in computer vision. This work
presents a systematic approach to visual modeling of a natural scene for robot navigation:
5
t
Spatio-temporal panoramic view
left side view
road side scene
left front view
v
road surface
right front view
omnidirectional view
right side view
Fig. 2. Full view vision for robot navigation
1. A purposive, multi-scale and full-view visual scene modeling approach is proposed for visual
navigation in a natural environment (Chapter 1 - Chapter 5). As a typical instance, an integrated
system POST is proposed which combines three novel modules (Fig. 2): Panoramic vision for
landmark recognition, Omnidirectional vision for road understanding and STereo vision for obstacle
detection. This approach tries to overcome the drawbacks of traditional visual navigation methods
that have mostly depended on local and/or single view visual information. However, the proposed
approach is not just a simple combination of the three novel sensors and methods, but rather a
systematic integration under the strategy of purposive vision (“the right way for the right work”), and
under the philosophy of a systems approach which emphasizes that “the whole is more than sum of
its components”. Thus, correct sensor design, adequate levels of scene representation and
corresponding robust and fast algorithms are specifically explored for each given task while the
interconnection among the vision sub-systems are taken into consideration under the overall goal of
autonomous navigation. The human-robot cooperation in different navigation modes (autonomous,
semi-autonomous and tele-operational) and different levels of vision enhancements (video
enhancement, stereo enhancement, view enhancement, information enhancement and virtualized
reality) will be discussed.
6
(1) panoramic texture map
Horizontal wedge and a row of flags
building façade and steps
Pine tree
trees
depth changes in the wall
building bridge pedestrian
pine tree and bamboo
(2) panoramic depth map
(3) parallel projection of the 3D panorama
Fig. 3. 3D Panoramic representation for landmark selection
2. A two-stage method is presented for 3D panoramic scene modeling for landmark selection
(Chapter 2). As inputs, image sequences are captured by a video camera subject to small but
unpredictable fluctuation on a common road surface. First, a 3D image stabilization method is
proposed which eliminates fluctuation from the vehicle’s smooth motion so that "seamless"
panoramic view images (PVIs) and epipolar plane images (EPIs) can be generated. Second, an
efficient panoramic EPI analysis method is proposed to combine the advantages of both PVIs and
EPIs efficiently in two important steps: frequency domain locus orientation detection, and spatiotemporal domain motion boundary localization. The two-stage method not only combines ZhengTsuji’s PVI method with Bolle-Baker’s EPI analysis, resulting in the so-called panoramic EPI
method, but also generalizes them to handle image sequences subject to small but unpredictable
camera fluctuations. Since camera calibration, image segmentation, feature extraction and matching
are completely avoided, all the proposed algorithms are fully automatic and rather general. Finally, a
compact representation in the form of a 3D panorama for a large-scale scene is constructed that can
be used effectively for generalized landmark selection for robot navigation (Fig. 3). This method will
7
further be applied in image-based rendering.
Panoramic Info.
( robot location)
steering
DFM
Binocular Info.
(obstacles)
Data Fusion
road class c
RCN
road orientation n
RONs
c
R1
{a(k)}
R2
…
RC
{u(n)}
IPM
DFT
PCA
Polar Trans.
OmniView Sensor
Fig. 4. . ROVINN architecture and the interconnection with other two modules ( RCN: Road Classification
Network; RON: Road Orientation Network; DFM: Data Fusion Module; IPM: Image Processing Module; DFT:
Discrete Fourier Transform; PCA: Principal Component Analysis)
3. A new road following approach, the Road Omni-View Image Neural Networks (ROVINN), has
been proposed (Chapter 3). It combines the omnidirectional image sensing technique with neural
networks in such a manner that the robot is able to learn recognition and steering knowledge from the
omnidirectional road images that in turn guarantee that the robot will never miss the road. The
ROVINN approach brings Yagi’s COPIS (conic omnidirectional projection image sensor) method to
outdoor road scenes and provides an alternative solution different from CMU’s ALVINN system.
Compact and rotation-invariant image features are extracted by integrating an omnidirectional
eigenspace representation with frequency analysis, using principal component analysis (PCA) and
Fourier transforms (DFTs). The modular neural networks of the ROVINN estimate road orientations
more robustly and efficiently by classifying the roads as a first step, which enables the robot to adapt
to various road types automatically.
8
trees
a person
Fig. 5. Image gaze transformation and obstacle detection. Top: Left and right view in a single camera image;
Bottom-left: rectified left image by gaze transformation; Bottom-right: obstacle region after zero-disparity gaze
control. The difference image shows that the ground images have been registered.
4. A novel method called the Image Gaze Transformation is presented for stereo-vision-based road
obstacle detection (Chapter 4). Obstacle detection is modeled as a reflexive behavior of detecting
anything that is different from a planar road surface. Dynamic gaze transformation algorithms are
developed so that the algorithms can work on a rough road surface. The novelty of the (dynamic)
gaze transformation method, which resembles gaze control of the human vision, lies in the fact that it
brings the road surface to zero disparity so that the feature extraction and matching procedures of
traditional stereo vision methods are completely avoided in the proposed obstacle detection
algorithms. The progressive processing strategy from yes/no verification, through focus of attention,
to 3D measurement based on the reprojection transformation make the hierarchical obstacle detection
techniques efficient, fast and robust.
9
To validate the proposed strategies and methods, we have implemented the following algorithms and
systems.
(1) Design of novel sensors. An omni-view image sensor is designed and realized (Fig. 6), and its
properties for outdoor road understanding are thoroughly studied. A patented single camera binocular
vision system with full horizontal field of view is also designed and constructed where left and right
views are projected to the up- and bottom halves of a single image (Fig. 7). It has been put into real
road application for obstacle detection. An inexpensive and integrated full view smart sensor POST
(Panoramic, Omnidirectional and STereo vision sensor) is proposed, which integrates a 360-degree
omnidirectional view with a binocular forward view as well as both left and right side views by using
a single camera and a set of reflection mirrors (Fig. 8).
(2) Real scene experiments. Experimental results of training and testing the ROVINN using real
road images have shown that the proposed method for road following is quite promising. A real-time
visual obstacle detection system has been set up and extensively tested on outdoor road scenes.
(3) 3D Scene modeling system. In the 3D panoramic scene modeling system (Fig. 9), the
algorithms for motion filtering and image stabilization, kinetic occlusion detection and depth layering
have been developed, and 3D layered panoramic models have been constructed for many image
sequences. These efforts form the basic framework for both global localization using generalized
landmark selection, and the synthesis of photo-realistic image-based renderings.
10
(1). A prototype of the OVI sensor
(2). An omnidirectional image
Fig 6. Omni-view image (OVI) sensor
L1
R1
L2
R2
real camera
L1, R1, L2, R2 are mirrors
(1) System geometry (top view)
(2) A real binocular image pair
Fig. 7. Single camera stereo vision system: left and right views are projected to the top and bottom halves of a
single image
11
C
conic mirror
MR
ML
R1
L1
L2
PR
PL
R2
P
O
SL
Sensor Target
SR
L1,R1,L2,R2,MR,ML,P are planar mirrors, C is a conic mirror.
PR and PL are the virtual left and right side view "cameras", SL and SR are
the two virtual binocular front view "cameras", and O is the virtual omniview "camera" looking at the conic mirror. The real camera (shown as an
illustrative sensor target) is perpendicularly pointing to the paper.
(1). POST: an integrated full view vision sensor
ST image
(PVI)
Binocular front views
(2 * 256*128)
t
Omnidirectional view
(60 pixel * 360°)
t
Left side view
(80*480)
Right side view
(80*480)
(2). A composite image (640x480 image)
Fig 8. Integrated full view vision sensor: POST
12
video
Depth Map Acquisition
video stabilization
EPI 1
orientation
EPI 2 &motion boundary
PVI & EPI generation
Panoramic texture map
EPI H H: height of a frame
……
Panoramic depth map
Texture-depth fusion
Landmark selection
Panoramic Landmark Model
visual navigation
Occlusion recovery & depth layering
Relief-like LAMP
Image-based LAMP
Image-based rendering
Fig. 9 System diagram of 3D panoramic scene modeling (PVI: Panoramic View image; EPI: Epipolar Plane
Image; LAMP: Layered, Adaptive-resolution and Multi-perspective Panorama)
13
Download