Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab Human-centered Interfaces • • • • Free users from desktop and wired interfaces Allow natural gesture and speech commands Give computers awareness of users Work in open and noisy environments - Outdoors -- H21 next to construction site! - Indoors -- crowded meeting room (E21) • Vision’s role: provide perceptive context Perceptive Context • • • • • Who is there? (presence, identity) What is going on? (activity) Where are they? (individual location) Which person said that? (audiovisual grouping) What are they looking / pointing at? (pose, gaze) Today: • Tracking speakers with an audio-visual microphone array • Tracking faces for gaze aware dialog interfaces • Speaker verification with higher-order joint audio-visual statistics Tracking speakers • Track location and short-term identity • Should work with lots of fast lighting variation - Stereo based methods - New technique for dense background modeling • Estimate trajectory of 3D foreground points from multiple views over time • Guide active cameras and microphone array • Recognize activity and participant roles Range-based stereo person tracking • Range can be insensitive to fast illumination change • Compare range values to known background • Project into 2D overhead view Plan view Foreground Range Intensity • Merge data from multiple stereo cameras.. • Group into trajectories… • Examine height for sitting/standing… Fast/dense stereo foreground Standard stereo searches exhaustively per frame: left image (reference) right image But if the background is predictable, we can prune most of the search! Fast/dense stereo foreground Background New Image Intensity Range Foreground depth Sparse Stereo Model Scene Range Image What to do when there are undefined range values in background and a new foreground image has a valid range value? • conservative -- call it background • liberal -- call it foreground Conservative Segmentation Model Lighting Change New person Type I errors! (Misses most of person) Foreground Foreground Liberal Segmentation Model Type II errors! (False Positives) New person Foreground Lighting Change Foreground Dense Stereo Model Acquistion Different gain settings yield different regions of undefined range values Dense Stereo Model Acquistion Combined valid measurements from observations at different gain and/or illumination settings: + + State of the Art (cont’d) if you want really dense range backgrounds from one stereo view…. Visibility Constraints B P I 1 D P I p 2 D b b C1 px py p 1 ( p , p ) x y ID 1 C2 1 P T E1 p D 2 2 D E b T 2T 1 P E E P PX PY 2 PZ 2 b T D2 T E1 P E E W T Visibility Constraints for Virtual Backgrounds I 1 D p C1 I 2 D C2 Simple background subtraction Virtual Background Segmentation Range-based stereo person tracking • Range can be insensitive to fast illumination change • Compare range values to known background • Project into 2D overhead view Plan view Foreground Range Intensity • Merge data from multiple stereo cameras.. • Group into trajectories… • Examine height for sitting/standing… Multiple stereo views Merged Plan-view segmentation Points -> trajectories -> active sensing Spatiotemporal points Active Camera motion Microphone array Activity classification trajectories Test Environment Active camera tracking Audio input in noisy environments • Acquire high-quality audio from untethered, moving speakers • “Virtual” headset microphones for all users Solutions • Wireless close-talking microphone • Shotgun microphone • Microphone array • Our solution: large, vision- and audio-guided microphone array Our approach • Large-array, non-linear geometry - allows selection of 3-D volume of space - can select based on distance (more than beamforming) • Integrated with vision tracking - makes real-time localization of multiple sources feasible - known array geometry and target location ==> simple system - precalibrate array with known source tone • Related Work - small-aperture vision-guided microphone arrays (Waibel) - large-aperture audio-guided arrays (Silverman) Microphone Arrays • Microphones at known locations synchronized in time • Electronically focused directional receiver Array focusing • Delay-and-sum beamforming – compensate for propagation delays to reinforce target signal: N y (t , rtarget ) ai xi (t d i (r )) i 1 d i (r ) rtarget ri / vs rtarget : target position ri : position of i th microphone vs : speed of sound Delay and sum array processing • Calibrate using cross-correlation analysis with single source presentation • Compute delay and weight based on geometry of array and target - delay: time of flight - weight: estimated SNR based on distance • Filtered source is delayed and weighted sum of all microphones. Beamforming Example Received Signals Delayed Signals Delayed And Summed Signal Array Size • Beam width (array span)-1 • Large arrays select fully bounded volumes • Small arrays select directional beams Related Work • Small-aperture vision-guided microphone arrays (Bub, Hunke, and Waibel) • Large-aperture audio-guided arrays (Silverman et al.) Output Power (dB) Beamforming Demonstration Position (meters) First person moves on oval path while counting; second person stationary while reciting alphabet. • Result from single microphone at center of room: • Result from microphone array with focus fixed at initial position of moving speaker: Array Steering • Audio-only – max-power search 2 rtarget arg max ( y ( t , r ) dt ) r Array output power (dB) Audio-only steering is hard. Position (meters) Array output power (dB) Audio-only steering is hard. Position (meters) Hybrid Localization • Vision-only steering isn’t perfect. - Joint calibration - Person tracking, not mouth tracking • Can correct vision-based estimate with limited search (implemented as gradient ascent) in audio domain System flow (single target) Video Streams Vision-based tracker Audio Streams Gradient ascent search in array output power rvision rav Delay-and-sum beamformer y(t , rav ) Results Localization Technique SNR (dB) Single microphone -6.6 Video only -4.4 Audio-Video Hybrid 2.3 • • Single microphone: Hybrid tracking with beamforming: Results continued Status • Fully 3-d, multimodal sound source localization and separation system • Realtime implementation of delay-and-sum array processing • Future work: - Compare to commercial linear array More sophisticated beamforming (null steering) Connect to automated speech recognition (in progress) Incorporating single channel source separation techniques - AVMI - ICA - source modeling Today Tracking speakers with an audio-visual microphone array Tracking faces for gaze aware dialog interfaces [John Fisher] Speaker verification with higher-order joint audio-visual statistics Brightness and depth motion constraints I It I t+1 Z Zt Z t+1 Closed-loop 3D tracker Track users head gaze for hands-free pointing… Head-driven cursor Related Projects: • Schiele • Kjeldsen • Toyama Current application for second pointer or scrolling / focus of attention… Head-driven cursor Task Single cursor Two hand cursors Head-hand cursors Gaze aware dialog interface • Interface Agent responds to gaze of user - agent should know when it’s being attended to - turn-taking pragmatics - anaphora / object reference • Prototype - E21 interface “sam” - current experiments with face tracker on meeting room table • WOZ initial user tests • Integrating with wall cameras and hand gesture interface Is that you talking? New single channel algorithm to prevent stray utterances Match video to audio! - Audio-visual Synchrony Detection - Analyze Mutual Information between signals Find maximally informative subspace projection between audio and video… [ Fisher and Darrell ] Perceptual context Take-home message: vision provides Perceptual Context to make applications aware of users.. So far: detection, ID, head pose, audio enhancement and synchrony verification… Soon: • gaze -- add eye tracking on pose stabilized face • pointing -- arm gestures for selection and navigation. • activity -- adapting outdoor activity classification [ Grimson and Stauffer ] to indoor domain…