Active Perception - Electrical Engineering & Computer Sciences

Active Perception
We not only see but we look, we
not only touch we feel,
Active Perception vs. Active
In the robotics and computer vision literature, the term
“active sensor” generally refers to a sensor that transmits
(generally electromagnetic radiation, e.g., radar, sonar,
ultrasound, microwaves and collimated light) into the environment
and receives and measures the reflected signals.
We believe that the use of active sensors is not a necessary
condition on active sensing, and that sensing can be performed
with passive sensors (that only receive, and do not
emit, information), employed actively.
Active Sensing
• Hence the problem of Active Sensing can be stated as a
• problem of controlling strategies applied to the data
• process which will depend on the current state of the
• data interpretation and the goal or the task of the
• The question may be asked, “Is Active Sensing only an
• application of Control Theory?” Our answer is: “No, at
• not in its simple version.” Here is why:
Active Perception
• 1) The feedback is performed not only on
sensory data
• but on complex processed sensory data, i.e.,
• extracted features, including relational features.
• 2) The feedback is dependent on a priori
knowledge and models
• that are a mixture of numeric/parametric and
• symbolic information.
Active Perception turned into an
engineering agenda
The implications of the active sensing/perception approach are the
1) The necessity of models of sensors. This is to say, first,
the model of the physics of sensors as well as the noise of
the sensors. Second, the model of the signal processing and data
reduction mechanisms that are applied on the measured
• data. These processes produce parameters with a definite
• range of expected values plus some measure of uncertainties.
• These models shall be called Local Models.
Engineering agenda,cont.
2) The system (which mirrors the theory) is modular as
dictated by good computer science practices and interactive,
that is, it acquires data as needed. In order to be able
to make predictions on the whole outcome, we need, in
addition to models of each module (as described in 1)
above), models for the whole process, including feedback.
We shall refer to these as Global Models.
3) Explicit specification of the initial and final state /goal.
If the Active Vision theory is a theory, what is its predictive
power? There are two components to our theory, each
with certain predictions:
Active Vision theory
1) Local models. At each processing level, local models
are characterized by certain internal parameters. Examples
of local models can be: region growing algorithm with internal
parameters, the local similarity and size of the local
neighborhood. Another example is an edge detection algorithm
with parameter of the width of the band pass filter in
which one is detecting the edge effect. These parameters
predict a) the definite range of plausible values, and b) the
noise and uncertainty which will determine the expected
resolution, sensitivity ,robustness of the output results from
each module
Active Vision,cont.
2) Global models characterize the overall performance
and make predictions on how the individual modules will
interact which in turn will determine how intermediate
results are combined. The global models also embody the
Global external parameters, the initial and final global state
of the system. The basic assumption of the Active Vision
approach is the inclusion of feedback into the system and
gathering data as needed. The global model represents all
the explicit feedback connection, parameters, and the optimization
criteria which guides the process.
Control Strategies
three distinct control stages proceeding in sequence:
processing in midterm,
completion of the task.
Strategies are divided with respect to the tradeoff
how much data measurement the system acquires (data
driven, bottom-up) and how much a priori or acquired
knowledge the system uses at a given stage (knowledge
driven, top-down). Of course, there is that strategy which
combines the two.
Bottom up and Top down process
• To eliminate possible ambiguities with the terms
bottom up
• and top-down, we define them here. Bottom-up
• driven), in this discussion, is defined as a control
• where no concrete semantic, context dependent
model is
• available, as opposed to the top-down strategy
where such
• knowledge is available.
• Different tasks will determine the design of
the system, i.e. the architecture.
• Consider the following tasks:
• Manipulation
• Mobility
• Communication and Interaction of
machine to machine or people to people
via digital media or people to machine.
• Geographically distributed communication and
interaction using multimedia (vision primarily)
using the Internet.
• We are concerned with primarily unspoken
communication: gestures and body motion.
• Examples are: coordinated movement such as
dance, physical exercises, training of manual
skills, remote guidance of physical activities.
• Recognition , Learning will play a role in all
the tasks.
• Serves as a constraint in the design.
• We shall consider only the constraints relevant
to the visual task that serves to accomplish the
physical activity.
• For example: in the manipulation task, the size
of the object will determine the data acquisition
strategy but also the design of the vision system
(choice of field of view, focal length, illumination,
and spatial resolution). Think of moving furniture
vs. picking up a coin.
• Another example: Mobility
• There is a difference if the mobility is on the
ground, in the air looking down or up.
• The position and orientation of the observer will
determine the interpretation of the signal.
• Furthermore there is a difference between
outdoor and indoor environment.
• Varied visibility conditions will influence the
design and the architecture.
• For distributed communication and
• The environment will depend on the
application, could be digitized environment
of the place where the participants are or it
also could be a virtual environment, for
example one can put people into a
historical environment (Rome, Pompei,
Active Vision System for 3D object
Table 1 below outlines the multilayered system of an
Active vision system, with the final goal of 3-D object/shape
recognition. The layers are enumerated from 0, 1, 2, . . *
with respect to the goal (intermediate results) and feedback
parameters. Note that the first three levels correspond to
monocular processing only. Naturally the menu of extracted
Features from monocular images is far from exhaustive. The
other 3-5 levels are based on binocular images. It is only
the last level that is concerned with semantic interpretation.
stopping conditions
control of the
directly measured
grossly focused
Physical device current lighting system
scene ,camera adjusted
open/close aperture
Control of the
directly measured
Physical device focus, zoom
on one object
Computed contrast
distance from
Control of low
computed only
2D segmentation
Level vision
threshold of the width
max .#of edges/regions
of filters
Table cont.
Feedback Parameters
Control of binocular
directly measured:
Depth map
System hardware
vergence angle
computed: range of admissible
depth values
Control of intermediate computed only:
Geometric vision
threshold of similarity
between surfaces
5.Control of
compute the position
3D object description
Several views
rotation of different views
Integration process
6. Control of semantic
recognition of 3D objects/scene
Several comments are in order:
1) Although we have presented the levels in a sequential
order, we do not believe that is the only way of the
flow of information through the system. The only
in the order of levels is that the lower levels
are somewhat more basic and necessary for the higher
levels to function.
2) In fact, the choice of at which level one accesses the
system very much depends on the given task and/or
the goal.
Active Visual Observer
• Several groups around the world build a
binocular active vision system that can
attend to and fixate a moving target.
• We will review two such systems one built
at UPENN,GRASP laboratory and the
other at KTH (Royal Institute of
Technology) in Stockhols,Sweden.
The UPENN System
A Binocular Active Vision System
• PennEyes is a head –in-hand system with
a binocular camera platform mounted on a
6 DOF robotic arm. Although physically
limited to reach of the arm, the
functionality of the head is extended
through the use of the motorized optics
(10x zoom). The architecture is configured
to rely minimally on external systems and .
Design considerations
• Mechanical:The precision positioning was
afforded by the PUMA arm. However the
binocular camera platform needed to weigh in
the range of 2.5 Kg.
• Optics: The use of motorized lenses (zoom,
focus and aperture) offered an increase
• Electronics: This was the most critical element in
the design. A MIMD DSP organization was
decided as the best tradeoff between
performance, extensibility and ease of
Puma Polka
Tracking Performance
• The two robots afforded objective
measures of tracking performance with
precision target.
• A three dimensional path with known
precision can be repeatedly generated ,
allowing the comparison of different visual
servoing algorithms.
BiSight Head
BiSight head
• Has an independent pan axes with the highest
tracking performance of 1000deg/s and
12,000deg/ssquare. The concern here is how
well can be maintained the calibration after
repeated exposure to acceleration and vibration.
• Another problem occurred with zoom adjustment
the focal length also changed.
• The binocular camera platform has 4 optical
(zoom and focus) and 2 mechanical (pan)
degrees of freedom.
C40 Architecture
• Beyond the basic computing power of the
individual C40s the performance of the
network is enhanced by the ability to
interconnect the modules with a fair
degree of flexibility as well as the ability
store an appreciable amount of
information. The former is made possible
up to six comports on each module and
the later by several Mbytes of local
C40 Architecture
Critical Issues
• The performance of any modularly
structured active vision system depends
critically on a few recurring issues. They
involve the coordination of processes
running on different subsystems, the
management of large data streams,
processing and transmission delays and
the control of systems operating at
different rates.
• The three major components of this modular
active vision system are independent entities
that work at their own pace. The lack of a
common time base makes synchronizing the
components a difficult task.
• In some cases , an external signal can be used
to synchronize independent hardware
components. In this system, C40 network, the
digitizers and the graphics module are slaved on
the vertical sync of the genlocked cameras.
Other considerations
• Bandwidth – large data streams
• System Integration. If data throughput becomes
the bottleneck, then some new data
compression algorithms must be invoked.
• Latency. Delays between the acquisition of a
frame and the motor response to it are an
inevitable problem of active vision systems.
Delays make the control more difficult because
they can cause instabilities.
• Multi-rate control. Active vision systems
suggests by their very nature a hierarchical
approach to control
• If the visual and mechanical control rates
are one or more orders of magnitude
apart, the mechanical control loops are
essentially independent of the visual
control loop.