S2-02%2f03Trevor-Darrell

advertisement
Perceptive Context for Pervasive
Computing
Trevor Darrell
Vision Interface Group
MIT AI Lab
Human-centered Interfaces
•
•
•
•
Free users from desktop and wired interfaces
Allow natural gesture and speech commands
Give computers awareness of users
Work in open and noisy environments
- Outdoors -- H21 next to construction site!
- Indoors -- crowded meeting room (E21)
• Vision’s role: provide perceptive context
Perceptive Context
•
•
•
•
•
Who is there? (presence, identity)
What is going on? (activity)
Where are they? (individual location)
Which person said that? (audiovisual grouping)
What are they looking / pointing at? (pose, gaze)
Today:
• Tracking speakers with an audio-visual microphone array
• Tracking faces for gaze aware dialog interfaces
• Speaker verification with higher-order joint audio-visual
statistics
Tracking speakers
• Track location and short-term identity
• Should work with lots of fast lighting variation
- Stereo based methods
- New technique for dense background modeling
• Estimate trajectory of 3D foreground points from
multiple views over time
• Guide active cameras and microphone array
• Recognize activity and participant roles
Range-based stereo person tracking
• Range can be insensitive to fast illumination change
• Compare range values to known background
• Project into 2D overhead view
Plan view
Foreground
Range
Intensity
• Merge data from multiple stereo cameras..
• Group into trajectories…
• Examine height for sitting/standing…
Fast/dense stereo foreground
Standard stereo searches exhaustively per frame:
left image (reference)
right image
But if the background is predictable, we can prune
most of the search!
Fast/dense stereo foreground
Background
New Image
Intensity
Range
Foreground depth
Sparse Stereo Model
Scene
Range Image
What to do when there are undefined range values in background and a
new foreground image has a valid range value?
• conservative -- call it background
• liberal -- call it foreground
Conservative Segmentation
Model
Lighting Change
New person
Type I errors!
(Misses most
of person)
Foreground
Foreground
Liberal Segmentation
Model
Type II errors!
(False Positives)
New person
Foreground
Lighting Change
Foreground
Dense Stereo Model Acquistion
Different gain settings yield different regions of undefined
range values
Dense Stereo Model Acquistion
Combined valid measurements from observations at
different gain and/or illumination settings:
+
+
State of the Art (cont’d)
if you want really dense range backgrounds from one stereo
view….
Visibility Constraints
B
P
I
1
D
P
I
p
2
D
b
b
C1
px




py


p 1

(
p
,
p
)
x
y
ID



1


C2
1
P  T E1 p
D
2
2
D
E
b T 2T 1 P
E E

P  PX
PY
2
PZ
2
b  T D2 T E1 P
E E
W

T
Visibility Constraints for Virtual
Backgrounds
I
1
D
p
C1
I
2
D
C2
Simple background subtraction
Virtual Background Segmentation
Range-based stereo person tracking
• Range can be insensitive to fast illumination change
• Compare range values to known background
• Project into 2D overhead view
Plan view
Foreground
Range
Intensity
• Merge data from multiple stereo cameras..
• Group into trajectories…
• Examine height for sitting/standing…
Multiple stereo views
Merged Plan-view segmentation
Points -> trajectories -> active sensing
Spatiotemporal
points
Active Camera motion
Microphone array
Activity classification
trajectories
Test Environment
Active camera tracking
Audio input in noisy environments
• Acquire high-quality audio from untethered, moving
speakers
• “Virtual” headset microphones for all users
Solutions
• Wireless close-talking microphone
• Shotgun microphone
• Microphone array
• Our solution: large, vision- and audio-guided microphone
array
Our approach
• Large-array, non-linear geometry
- allows selection of 3-D volume of space
- can select based on distance (more than beamforming)
• Integrated with vision tracking
- makes real-time localization of multiple sources feasible
- known array geometry and target location ==> simple system
- precalibrate array with known source tone
• Related Work
- small-aperture vision-guided microphone arrays (Waibel)
- large-aperture audio-guided arrays (Silverman)
Microphone Arrays
• Microphones at known locations synchronized in time
• Electronically focused directional receiver
Array focusing
• Delay-and-sum beamforming – compensate for
propagation delays to reinforce target signal:
N


y (t , rtarget )   ai xi (t  d i (r ))
i 1



d i (r )  rtarget  ri / vs

rtarget : target position

ri : position of i th microphone
vs : speed of sound
Delay and sum array processing
• Calibrate using cross-correlation analysis with single
source presentation
• Compute delay and weight based on geometry of array
and target
- delay: time of flight
- weight: estimated SNR based on distance
• Filtered source is delayed and weighted sum of all
microphones.
Beamforming Example
Received
Signals
Delayed
Signals
Delayed
And Summed
Signal
Array Size
• Beam width  (array span)-1
• Large arrays select fully bounded volumes
• Small arrays select directional beams
Related Work
• Small-aperture vision-guided microphone arrays (Bub,
Hunke, and Waibel)
• Large-aperture audio-guided arrays (Silverman et al.)
Output Power (dB)
Beamforming Demonstration
Position (meters)
First person moves on oval path while counting; second person
stationary while reciting alphabet.
• Result from single microphone at center of room:
• Result from microphone array with focus fixed at initial position of
moving speaker:
Array Steering
• Audio-only – max-power search


2
rtarget  arg max
(
y
(
t
,
r
)
dt
)


r
Array output power (dB)
Audio-only steering is hard.
Position (meters)
Array output power (dB)
Audio-only steering is hard.
Position (meters)
Hybrid Localization
• Vision-only steering isn’t perfect.
- Joint calibration
- Person tracking, not mouth tracking
• Can correct vision-based estimate with limited search
(implemented as gradient ascent) in audio domain
System flow (single target)
Video
Streams
Vision-based tracker
Audio
Streams
Gradient ascent search
in array output power

rvision

rav
Delay-and-sum beamformer

y(t , rav )
Results
Localization Technique
SNR (dB)
Single microphone
-6.6
Video only
-4.4
Audio-Video Hybrid
2.3
•
•
Single microphone:
Hybrid tracking with beamforming:
Results continued
Status
• Fully 3-d, multimodal sound source localization and
separation system
• Realtime implementation of delay-and-sum array
processing
• Future work:
-
Compare to commercial linear array
More sophisticated beamforming (null steering)
Connect to automated speech recognition (in progress)
Incorporating single channel source separation techniques
- AVMI
- ICA
- source modeling
Today
 Tracking speakers with an audio-visual microphone array
 Tracking faces for gaze aware dialog interfaces
[John Fisher]
Speaker verification with higher-order joint audio-visual
statistics
Brightness and depth motion constraints
I
It
I t+1
Z
Zt
Z t+1
Closed-loop 3D tracker
Track users head gaze for hands-free pointing…
Head-driven cursor
Related Projects:
• Schiele
• Kjeldsen
• Toyama
Current application
for second pointer or
scrolling / focus of
attention…
Head-driven cursor
Task
Single cursor
Two hand cursors
Head-hand cursors
Gaze aware dialog interface
• Interface Agent responds to gaze of user
- agent should know when it’s being attended to
- turn-taking pragmatics
- anaphora / object reference
• Prototype
- E21 interface “sam”
- current experiments with face tracker on meeting room table
• WOZ initial user tests
• Integrating with wall cameras and hand gesture interface
Is that you talking?
New single channel algorithm to prevent stray utterances
Match video to audio!
- Audio-visual Synchrony Detection
- Analyze Mutual Information between signals
Find maximally informative subspace projection between
audio and video… [ Fisher and Darrell ]
Perceptual context
Take-home message: vision provides Perceptual Context to make
applications aware of users..
So far: detection, ID, head pose, audio enhancement and
synchrony verification… Soon:
• gaze -- add eye tracking on pose stabilized face
• pointing -- arm gestures for selection and navigation.
• activity -- adapting outdoor activity classification [ Grimson and
Stauffer ] to indoor domain…
Download