Brief Presentation of ICMI ’ 03
N.Oliver & E.Horvitz paper
Nikolaos Mavridis, Feb ‘ 02
The menu for today:
An application that served as testbed & excuse
The architecture of recognition engines used
Two varieties of selective perception
Results
Big Ideas
An intro to resolver
The main big idea:
NO NEED TO NOTICE AND PROCESS
EVERYTHING ALWAYS!
SEER:
A multimodal system for recognizing office activity
General setting:
A basic requirement for visual surveillance and multimodal HCI, is the provision of of rich, human-centric notions of context in a tractable manner …
Prior work: mainly particular scenarios (waiving the hand etc.), HMM, DynBN
Output Categories:
PC=Phone Conversation
FFC=Face2Face Conversation
P=Presentation
O=Other Activity
NP=Nobody Present
DC=Distant Conversation (out of field of view)
Input:
Audio: PCA of LPC coeffs, energy, μ,σ of ω0 , zero cr. rate
Audio Localisation: Time Delay of Arrival (TDOA)
Video: skin color, motion, foreground and face densities
Mouse & Keyboard: History of 1,5 and 60sec of activity
Recognition engine: LHMM (Layered!)
First level:
Parallel discriminative HMM’s for categories:
Audio: human speech, music, silence, noise, ring, keyboard
Video: nobody, static person, moving person, multiperson
Second level:
Input: Outputs of above + derivative of sound loc + keyb histories
Output: PC, P, FFC, P, DC, N – longer temporal extent!
Selective Perception Strategies usable for both levels!
Selecting which features to use at the input of the HMM’s!
Example:
motion & skin density for one active person
Skin density & face detection for multiple people
Also for second stage: selecting which first stage HMM’s to run…
HMM’s vs LHMM’s
Compared to CP HMM’s (cart. Product, one long feature vector)
Prior knowledge about problem encoded in structure for LHMM’s
I.e. decomposition into smaller subproblems -> less training required, more filtered output for second stage, only first level needs retraining!
Why sense everything and compute everything always?!?
Two approaches:
EVI: Expected Value of Information (ala RESOLVER)
Decision theory and uncertainty reduction
EVI computed for different overlapping subsets , real time, every frame
Greedy, one-step lookahead approach for computing the next best set of observation to evaluate
Rate-based perception (somewhat similar to RIP BEHAVIOR)
Policies defined heuristically for specifying observational frequencies and duty cycles for each computed feature
Two baselines for comparison:
Compute everything!
Randomly select feature subsets
Endowing the perceptual system with knowledge of the value of action in the world …
EV ( f k
) m
P ( f k m | E ) max i
j
P ( M j
| E , f k m ) U ( M i
, M j
) f k
: The feature
(f.e.
4 features : subset
K 16) k ( k 1 ...
K ) f k m : All
(f.e.
for possible f
16 outcomes above, i.e.
of f k
( m k all four features
1 ...
M k
) and binary outcomes m 16)
E : All previous observatio nal Evidence
P ( f k m | E ) : Probabilit y of outcome given evidence
U ( M i
, M j
) : Utility of asserting ground truth M i as M j
P ( M | j
j
P ( M max i
(
E , j
| j
...
) f k m ) :
E ,
Probabilit f k m ) U ( M i
, M y of asserting j
)
: Expected utility
: activity
Expected utility for the ground
M j due to
truth f k m given ground state that maximizes
truth it
M i
But what we are really interested in is what we have to gain! Thus:
EVI ( f k
) EV ( f k
) max
P ( M j i j
Where we also account for:
|
E , f k m ) U ( M i
, M j
) cost
What we would given no sensing at all
Cost of sensing – but have to map cost and utility to the same currency !
(
HMM-ised implementation used!
Richer cost models:
Non-identity U matrix
Constant vs. activity-dependent costs (what else is running?) successful results! (no significant decrease in accuracy;-))
– f k
)
Simple idea:
In this case, no online-tuning of rates …
Doesn ’ t capture sequential prerequisites etc.
EVI: No significant performance decrease with much less computational cost!
Also effective in activity-dependent mode.
And even more to be gained!
No need to sense & compute everything always!
In essence we have a Planner :
a planner for goal-based sensing and cognition!
Not only useful for AI:
Approach might be useful for computational modeling of human performance , too …
Simple satisficing works:
No need for fully-optimised planning; with some precautions, one-step ahead with many approximations is sufficient –
ALSO more plausible for Humans! (ref:Ullman)
Easy co-existence with other goal-based modules:
We just need a method for distributing time-varying costs of sensing and cognitising actions (centralised stockmarket?)
As a future direction: time-decreasing confidence mentioned