Automatic audio analysis for content description & indexing

advertisement
Automatic audio analysis
for content description & indexing
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Outline
1
Auditory Scene Analysis (ASA)
2
Computational ASA (CASA)
3
Prediction-driven CASA
4
Speech recognition & sound mixtures
5
Implications for content analysis
Audio Indexing - Dan Ellis
1998feb04 - 1
Auditory Scene Analysis
1
“The organization of complex sound scenes
according to their inferred sources”
f/Hz
•
Sounds rarely occur in isolation
- organization required for useful information
•
Human audition is very effective
- unexpectedly difficult to model
•
‘Correct’ analysis defined by goal
- source shows independence, continuity
→ecological constraints enable organization
city22
4000
−40
2000
−50
1000
400
−60
200
−70
0
1
Audio Indexing - Dan Ellis
2
3
4
5
6
7
8
9
dB
time/s
1998feb04 - 2
Psychology of ASA
•
Extensive experimental research
- organization of ‘simple pieces’
(sinusoids & white noise)
- streaming, pitch perception, ‘double vowels’
•
“Auditory Scene Analysis” [Bregman 1990]
→ grouping ‘rules’
- common onset/offset/modulation,
harmonicity, spatial location
•
Debated... (Darwin, Carlyon, Moore, Remez)
(from
Darwin 1996)
Audio Indexing - Dan Ellis
1998feb04 - 3
2 Computational Auditory Scene Analysis
(CASA)
•
f/Hz
Automatic sound organization?
- convert an undifferentiated signal into a
description in terms of different sources
city22
4000
horn
−40
2000
horn
door crash
−50
1000
yell
−60
400
200
car noise
−70
0
1
2
3
4
5
6
7
8
9
dB
time/s
•
Translate psych. rules into programs?
- representations to reveal common onset,
harmonicity ...
•
Motivations & Applications
- it’s a puzzle: new processing principles?
- real-world interactive systems (speech, robots)
- hearing prostheses (enhancement, description)
- advanced processing (remixing)
- multimedia indexing...
Audio Indexing - Dan Ellis
1998feb04 - 4
CASA survey
•
Early work on co-channel speech
- listeners benefit from pitch difference
- algorithms for separating periodicities
•
Utterance-sized signals need more
- cannot predict number of signals (0, 1, 2 ...)
- birth/death processes
•
Ultimately, more constraints needed
- nonperiodic signals
- masked cues
- ambiguous signals
Audio Indexing - Dan Ellis
1998feb04 - 5
CASA1:
Periodic pieces
•
Weintraub 1985
- separate male & female voices
- find periodicities in each frequency channel by
auto-coincidence
- number of voices is ‘hidden state’
•
Cooke & Brown (1991-3)
- divide time-frequency plane into elements
- apply grouping rules to form sources
- pull single periodic target out of noise
frq/Hz
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
Audio Indexing - Dan Ellis
0.4
0.6
0.8
1.0
time/s
0.2
0.4
0.6
0.8
1.0
1998feb04 - 6
time/s
CASA2:
Hypothesis systems
•
Okuno et al. (1994-)
- ‘tracers’ follow each harmonic + noise ‘agent’
- residue-driven: account for whole signal
•
Klassner 1996
- search for a combination of templates
- high-level hypotheses permit front-end tuning
3760 Hz
Buzzer-Alarm
2540 Hz
2230 Hz
2350 Hz
Glass-Clink
1675 Hz
1475 Hz
950 Hz
500 Hz
460 Hz
420 Hz
Phone-Ring
1.0
•
Siren-Chirp
2.0
3.0
4.0 sec
1.0
2.0
TIME
TIME
(a)
(b)
3.0
4.0 sec
Ellis 1996
- model for events perceived in dense scenes
- prediction-driven: observations - hypotheses
Audio Indexing - Dan Ellis
1998feb04 - 7
CASA3:
Other approaches
•
Blind source separation (Bell & Sejnowski)
- find exact separation parameters by maximizing
statistic e.g. signal independence
•
HMM decomposition (RK Moore)
- recover combined source states directly
•
Neural models (Malsburg, Wang & Brown)
- avoid implausible AI methods (search, lists)
- oscillators substitute for iteration?
Audio Indexing - Dan Ellis
1998feb04 - 8
Prediction-driven CASA
3
Perception is not direct
but a search for plausible hypotheses
•
Data-driven...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
vs. Prediction-driven
Grouping
rules
Source
groups
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
•
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
Motivations
- detect non-tonal events (noise & clicks)
- support ‘restoration illusions’...
→ hooks for high-level knowledge
+ ‘complete explanation’, multiple hypotheses,
resynthesis
Audio Indexing - Dan Ellis
1998feb04 - 9
Analyzing the continuity illusion
•
Interrupted tone heard as continuous
- .. if the interruption could be a masker
f/Hz
ptshort
4000
2000
1000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
time/s
•
Data-driven just sees gaps
•
Prediction-driven can accommodate
- special case or general principle?
Audio Indexing - Dan Ellis
1998feb04 - 10
Phonemic Restoration (Warren 1970)
•
Another ‘illusion’ instance
•
Inference relies on high-level semantics
frq/Hz
nsoffee.aif
3500
3000
2500
2000
1500
1000
500
0
1.2
•
1.3
1.4
1.5
1.6
1.7
time/s
Incorporating knowledge into models?
Audio Indexing - Dan Ellis
1998feb04 - 11
Subjective ground-truth in mixtures?
•
Listening tests collect ‘perceived events’:
•
Consistent answers:
f/Hz
City
4000
2000
1000
400
200
0
1
2
3
4
5
6
7
8
9
Horn1 (10/10)
S9−horn 2
S10−car horn
S4−horn1
S6−double horn
S2−first double horn
S7−horn
S7−horn2
S3−1st horn
S5−Honk
S8−car horns
S1−honk, honk
Crash (10/10)
S7−gunshot
S8−large object crash
S6−slam
S9−door Slam?
S2−crash
S4−crash
S10−door slamming
S5−Trash can
S3−crash (not car)
S1−slam
Horn2 (5/10)
S9−horn 5
S8−car horns
S2−horn during crash
S6−doppler horn
S7−horn3
Truck (7/10)
S8−truck engine
S2−truck accelerating
S5−Acceleration
S1−rev up/passing
S6−acceleration
S3−closeup car
S10−wheels on road
Audio Indexing - Dan Ellis
1998feb04 - 12
PDCASA example:
City-street ambience
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
1
2
3
Wefts1−4
f/Hz
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
•
1
2
3
Problems
- error allocation
- source hierarchy
Audio Indexing - Dan Ellis
4
5
6
7
8
9
time/s
dB
- rating hypotheses
- resynthesis
1998feb04 - 13
Speech recognition
& sound mixtures
4
•
Conventional speech recognition:
signal
Feature
extraction
low-dim.
features
Phoneme
classifier
phoneme
probabilities
HMM
decoder
words
- signal assumed entirely speech
- find valid labelling by discrete labels
- class models from training data
•
Some problems:
- need to ignore lexically-irrelevant variation
(microphone, voice pitch etc.)
- compact feature space → everything speech-like
•
Very fragile to nonspeech, background
- scene-analysis methods very attractive...
Audio Indexing - Dan Ellis
1998feb04 - 14
CASA for speech recognition
•
Data-driven: CASA as preprocessor
- problems with ‘holes’ (but: Cooke, Okuno)
- doesn’t exploit knowledge of speech structure
•
Prediction-driven: speech as component
- same ‘reconciliation’ of speech hypotheses
- need to express ‘predictions’ in signal domain
Speech
components
Hypothesis
management
input
mixture
Audio Indexing - Dan Ellis
Noise
components
Predict
& combine
Periodic
components
Front end
Compare
& reconcile
1998feb04 - 15
Example of speech & nonspeech
f/Bark
15
(a) Clap (clap8k−env.pf)
10
5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
(b) Speech plus clap (223cl−env.pf)
dB
60
40
20
(c) Recognizer output
h#
w n
h#
ay
n
n
ay
tcl t
n
uw
t
f
uw
ay
ah
f
s
ay
ay ow
h# v
s eh
v
ow
h#
five
oh
<SIL>
s
v ah
eh
v
n
ax
h#
n
tcl
<SIL>
nine
two
seven
(d) Reconstruction from labels alone (223cl−renvG.pf)
(e) Slowly−varying portion of original (223cl−envg.pf)
(f) Predicted speech element ( = (d)+(e) ) (223cl−renv.pf)
(g) Click5 from nonspeech analysis (justclick.pf)
(h) Spurious elements from nonspeech analysis (nonclicks.pf)
0.0
•
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
Problems:
- undoing classification & normalization
- finding a starting hypothesis
- granularity of integration
Audio Indexing - Dan Ellis
1998feb04 - 16
Implications for content analysis:
Using CASA to index soundtracks
5
f/Hz
city22
4000
horn
−40
2000
horn
door crash
−50
1000
yell
−60
400
200
car noise
−70
0
1
2
3
4
5
6
7
8
9
dB
time/s
•
What are the ‘objects’ in a soundtrack?
- subjective definition → need auditory model
•
Segmentation vs. classification
- low-level cues → locate events
- higher-level ‘learned’ knowledge to give
semantic label (footstep, crash)
... AI complete?
•
But: hard to separate
- illusion phenomena suggest auditory
organization depends on interpretation
Audio Indexing - Dan Ellis
1998feb04 - 17
Using speech recognition for indexing
•
Active research area:
Access to news broadcast databases
- e.g. Informedia (CMU), ThisL (BBC+...)
- use LV-CSR to transcribe,
then text-retrieval to find
- 30-40% word error rate, still works OK
•
Several systems at NIST TREC workshop
•
Tricks to ‘ignore’ nonspeech/poor speech
Audio Indexing - Dan Ellis
1998feb04 - 18
Open issues in automatic indexing
•
How to do ASA?
•
Explanation/description hierarchy
- PDCASA: ‘generic’ primitives
+ constraining hierarchy
- subjective & task-dependent
•
Classification
- connecting subjective & objective properties
→ finding subjective invariants, prominence
- representation of sound-object ‘classes’
•
Resynthesis?
- a ‘good’ description should be adequate
- provided in PDCASA, but low quality
- requires good knowledge-based constraints
Audio Indexing - Dan Ellis
1998feb04 - 19
Conclusions
6
•
Auditory organization is required in real
environments
•
We don’t know how listeners do it!
- plenty of modeling interest
•
Prediction-reconciliation can account for
‘illusions’
- use ‘knowledge’ when signal is inadequate
- important in a wider range of circumstances?
•
Speech recognizers are a good source of
knowledge
•
Automatic indexing implies ‘synthetic listener’
- need to solve a lot of modeling issues
Audio Indexing - Dan Ellis
1998feb04 - 20
Download