Sound content analysis for indexing and understanding

advertisement
Sound content analysis
for indexing and understanding
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Outline
1
Sound content analysis
2
Speech recognition
3
Auditory scene analysis
4
Audio content indexing
5
Conclusions
Sound content analysis - Dan Ellis
2000-02-08 - 1
Sound content analysis
1
•
Overall goal: ‘Useful’ data from sound
frq/Hz
4000
2000
0
0
2
4
6
8
10
12
time/s
- which depends on the goal
•
Involving:
- continuous → discrete
- source separation
- extract ‘semantic’ content
words
actions/events
Sound content analysis - Dan Ellis
2000-02-08 - 2
The space of sound analysis research
abstract
Speech
recognition
Computational
auditory scene
analysis
Content-based
retrieval
Coding/
compression
Auditory
modeling
concrete
engineering
Sound content analysis - Dan Ellis
perception
2000-02-08 - 3
Outline
1
Sound content analysis
2
Speech recognition
- Classic speech recognition
- The connectionist-HMM hybrid
- Strength through combinations
3
Auditory scene analysis
4
Audio content indexing
5
Conclusions
Sound content analysis - Dan Ellis
2000-02-08 - 5
Speech recognition: Dictation
2
•
Observations X = {X1..XN} → States S ={S1..SN}
Markov assumption
S∗ = argmax P ( S X )
S
P ( S, X )
= argmax ------------------P( X )
S
= argmax
S
∏ P( X i Si ) ⋅ P( Si Si – 1 )
i
acoustic prob.
•
transition prob.
State sequence {Si} (e.g. phones) define words
Word models
s
Feature
calculation
feature
vectors
sound
•
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
HMM
decoder
Acoustic
classifier
phone
likelihoods
words
Training (on large datasets) is the key
- EM iteration for acoustic & transition probs.
Sound content analysis - Dan Ellis
2000-02-08 - 6
The connectionist-HMM hybrid
(Morgan & Bourlard, 1995)
•
P(Xi|Si) is acoustic likelihood model
- model distribution with, e.g., Gaussian mixtures
•
Replace with posterior, P(Si|Xi):
Neural network acoustic classifier
h#
pcl
bcl
tcl
dcl
C0
C1
Feature
calculation
Word
models
HMM
decoder
C2
sound
Language
model
Posterior probabilities
for context-independent
phone classes
Ck
tn
Words
tn+w
- neural network estimates phone given acoustics
- discriminative
•
Simpler structure for research
Sound content analysis - Dan Ellis
2000-02-08 - 7
Visualizing speech recognition
sound
Feature
calculation
feature vectors
Network
weights
Acoustic
classifier
phone probabilities
Word models
Language model
HMM
decoder
phone & word
labeling
Sound content analysis - Dan Ellis
2000-02-08 - 8
Combination schemes
•
How to use complementary features?
Feature 1
calculation
Feature
combination
Feature 2
calculation
HMM
decoder
Acoustic
classifier
Speech
features
Phone
probabilities
Word
hypotheses
“^”
Word
hypotheses
“•”
Input
sound
Feature 1
calculation
Acoustic
classifier
HMM
decoder
Posterior
combination
Feature 2
calculation
Input
sound
Acoustic
classifier
Phone
probabilities
Speech
features
Feature 1
calculation
Acoustic
classifier
HMM
decoder
Hypothesis
combination
Feature 2
calculation
Input
sound
Acoustic
classifier
Speech
features
Sound content analysis - Dan Ellis
Final
hypotheses
HMM
decoder
Phone
probabilities
Word
hypotheses
2000-02-08 - 9
Combining feature streams
•
How to allocate feature dimensions to models?
- lower-dimension models train more quickly
- higher-dimension models find more interactions
•
PLP & MSG for Aurora (digits in noise):
- PLP are ‘conventional’ features
- MSG developed as robust alternative
- Evaluate by word-error rate (WER) compared to
default baseline
Features
Parameters
baseline WER ratio
plp12•dplp12
136k
97.6%
plp12^dplp12
124k
89.6%
msg3a•msg3b
145k
101.1%
msg3a^msg3b
133k
85.8%
plp12•dplp12•msg3a•msg3b
281k
76.5%
plp12^dplp12^msg3a^msg3b
245k
74.1%
plp12^dplp12•msg3a^msg3b
257k
63.0%
Sound content analysis - Dan Ellis
2000-02-08 - 10
Tandem connectionist models
(with Hermansky et al., OGI)
•
How can we combine neural net & GM models?
Feature
calculation
Input
sound
Neural
net model
(Posterior
decoder)
(Phone
probabilities)
Speech
features
Pre-nonlinearity
outputs
PCA
orthogn'n
(Hybrid system
output)
Subword
likelihoods
Othogonal
features
HTK
GM model
Tandem system
output
HTK
decoder
- (GMM system does not know they are phones)
•
Result: better performance than either alone!
- neural net has trained discriminatively
- GMM HMMs learn context-dependent structure
→extract complementary info from training data
System-features
baseline WER ratio
HTK-mfcc
100.0%
Hybrid-mfcc
84.6%
Tandem-mfcc
64.5%
Tandem-plp+msg
47.2%
Sound content analysis - Dan Ellis
2000-02-08 - 11
Aurora “Distributed SR” evaluation
•
7 telecoms company submissions:
Aurora DSR Evaluation 1999 Results
Avg. WER -20-0dB
Baseline improvement
100.00%
80.00%
60.00%
40.00%
20.00%
Ta
nd
em
2
S6
S5
S4
S3
Ta
nd
em
1
-20.00%
S2
S1
Ba
se
lin
e
0.00%
- Tandem systems from OGI-ICSI-Qualcomm
Sound content analysis - Dan Ellis
2000-02-08 - 12
Outstanding issues in speech recognition
•
Are we on the right path?
- useful dictation products exist
- evaluation results improve every year
- .. but appear to be asymptoting
•
Is dictation enough?
- a useful focus initially
- .. but not speech understanding
- .. and has skewed research
•
What should be our research priorities?
- straight ASR research is hard to fund
- need to look at harder domains
- need to connect it to applications
Sound content analysis - Dan Ellis
2000-02-08 - 13
Outline
1
Sound content analysis
2
Speech recognition
3
Auditory scene analysis
- Psychological phenomena
- Computational modeling
- Prediction-driven analysis
- Incorporating speech
4
Audio content indexing
5
Conclusions
Sound content analysis - Dan Ellis
2000-02-08 - 14
Auditory Scene Analysis (ASA)
3
“The organization of sound scenes
according to their inferred sources”
f/Hz
city22
4000
−40
2000
−50
1000
400
−60
200
−70
0
1
2
3
4
5
6
7
8
9
dB
time/s
•
Sounds rarely occur in isolation
- need to ‘separate’ for useful information
•
Human audition is very effective
- computational models have a lot to learn
Sound content analysis - Dan Ellis
2000-02-08 - 15
Psychology of ASA
•
Extensive experimental research
- perception of simplified stimuli (sinusoids, noise)
•
“Auditory Scene Analysis” [Bregman 1990]
- first: break mixture into small elements
- elements are grouped in to sources using cues
•
Grouping ‘rules’ (Darwin, Carlyon, ...):
- common onset/offset/modulation, harmonicity,
spatial location, ...
- relate to intrinsic (ecological) regularities
Onset
map
Frequency
analysis
Harmonicity
map
Source
properties
Grouping
mechanism
Position
map
(after Darwin, 1996)
Sound content analysis - Dan Ellis
2000-02-08 - 16
Computational Auditory Scene Analysis
(CASA)
•
input
mixture
Literal model of Bregman... (e.g. Brown 1992):
Front end
signal
features
(maps)
Object
formation
discrete
objects
Grouping
rules
Source
groups
freq
onset
time
period
frq.mod
•
Goals
- identify and segregate different sources
- resynthesize separate outputs!
Sound content analysis - Dan Ellis
2000-02-08 - 17
Grouping model results
•
frq/Hz
Able to extract voiced speech:
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
•
0.6
0.8
1.0
time/s
0.2
0.4
0.6
0.8
Limitations
- resynthesis via filter-mask
- only periodic targets
- robustness of discrete objects
Sound content analysis - Dan Ellis
2000-02-08 - 18
1.0
time/s
Context, expectations & predictions
Perception is not direct
but a search for plausible hypotheses
•
Bregman’s “old-plus-new” principle:
A change in a signal will be interpreted as an
added source whenever possible
•
E.g. the ‘continuity illusion’:
f/Hz
ptshort
4000
2000
1000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
time/s
- tones alternates with noise bursts
- noise is strong enough to mask tone
... so listener discriminate presence
- continuous tone perceived for gaps ~100s of ms
→ Inference acts at low, preconscious level
Sound content analysis - Dan Ellis
2000-02-08 - 19
Modeling top-down processing:
‘Prediction-driven’ CASA (PDCASA):
•
input
mixture
Data-driven...
Front end
signal
features
Object
formation
discrete
objects
Grouping
rules
Source
groups
vs. Prediction-driven
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
•
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
PDCASA key features:
- ‘complete explanation’ of all scene energy
- vocabulary of periodic/noise/transient elements
- multiple hypotheses
- explanation hierarchy
Sound content analysis - Dan Ellis
2000-02-08 - 20
PDCASA for the continuity illusion
•
Subjects hear the tone as continuous
... if the noise is a plausible masker
f/Hz
ptshort
4000
2000
1000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
i
1.4
/
•
Data-driven analysis gives just visible portions:
•
Prediction-driven can infer masking:
Sound content analysis - Dan Ellis
2000-02-08 - 21
PDCASA analysis of a complex scene
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
1
Sound content analysis - Dan Ellis
2
3
4
5
6
7
8
9
time/s
dB
2000-02-08 - 22
CASA for speech recognition
•
Data-driven: CASA as preprocessor
- problems with ‘holes’ (but: Okuno)
- doesn’t exploit knowledge of speech structure
•
Missing data (Cooke &c, de Cheveigné)
- CASA cues distinguish present/absent
- RESPITE project: modifications to recognizer
•
Prediction-driven: speech as component
- same ‘reconciliation’ of speech hypotheses
- need to express ‘predictions’ in signal domain
Speech
components
Hypothesis
management
input
mixture
Noise
components
Predict
& combine
Periodic
components
Front end
Sound content analysis - Dan Ellis
Compare
& reconcile
2000-02-08 - 23
Other signal-separation approaches
•
HMM decomposition (RK Moore ’86)
- recover combined source states directly
model 2
model 1
observations
•
Blind source separation (Bell & Sejnowski ’94)
- find exact separation parameters by maximizing
statistic e.g. signal independence
Sound content analysis - Dan Ellis
2000-02-08 - 24
Outstanding issues in CASA
•
What is the architecture?
- data-driven versus prediction-driven
- representations at different levels
- hypothesis search
•
How to combine different cues?
- priority of different cues
- resolving conflicting cues
- bottom-up versus top-down
•
How to exploit training data?
- .. the big lesson from speech recognition
•
Evaluation
- .. a more subtle lesson
Sound content analysis - Dan Ellis
2000-02-08 - 25
Outline
1
Sound content analysis
2
Speech recognition
3
Auditory scene analysis
4
Audio content indexing
- Spoken document retrieval
- Handling nonspeech audio
- Object-based analysis and retrieval
- Audio-video content organization
5
Conclusions
Sound content analysis - Dan Ellis
2000-02-08 - 26
4
F0:
F4:
F5:
Audio content indexing:
Spoken document retrieval (SDR)
•
Idea: speech recognition transcripts as indexes
•
Best broadcast news systems are not great
- 15-30% WER on real broadcasts
•
Word errors vary in their impact:
THE VERY EARLY RETURNS OF THE NICARAGUAN PRESIDENTIAL ELECTION
SEEMED TO FADE BEFORE THE LOCAL MAYOR ON A LOT OF LAW
AT THIS STAGE OF THE ACCOUNTING FOR SEVENTY SCOTCH ONE LEADER
DANIEL ORTEGA IS IN SECOND PLACE THERE WERE TWENTY THREE
PRESIDENTIAL CANDIDATES OF THE ELECTION
THE LABOR MIGHT DO WELL TO REMEMBER THE LOST A MAJOR EPISODE OF
TRANSATLANTIC CONNECT TO A CORPORATION IN BOTH CONSERVATIVE PARTY
OFFICIALS FROM BRITAIN GOING TO WASHINGTON THEY WENT TO WOOD BUYS
GEORGE BUSH ON HOW TO WIN A SECOND TO NONE IN LONDON THIS IS
STEPHEN BEARD FOR MARKETPLACE
•
Good enough for information retrieval (IR)
- e.g. TREC-8 average precision:
reference transcript ~ 0.5
30% WER ~ 0.4
Sound content analysis - Dan Ellis
2000-02-08 - 27
Thematic Indexing of Spoken Language
(with Sheffield, Cambridge, BBC)
•
SDR for BBC broadcast news archive
- 1000+ hr archive, automatically updated
Sound content analysis - Dan Ellis
2000-02-08 - 28
Speech and nonspeech
(with Gethin Williams)
•
ASR run over entire soundtracks?
- for nonspeech, result is nonsense
•
Watch behavior of speech acoustic model:
- average per-frame entropy
- ‘dynamism’ - mean-squared 1st-order difference
Dynamism vs. Entropy for 2.5 second segments of speecn/music
3.5
Spectrogram
frq/Hz
Speech
Music
Speech+Music
3
4000
2.5
2000
2
speech
music
Entropy
0
speech+music
1.5
40
1
20
0
0
2
4
6
8
10
12
time/s
0.5
Posteriors
0
0
•
0.05
0.1
0.15
0.2
Dynamism
0.25
0.3
1.3% error on 2.5 second speech-music testset
Sound content analysis - Dan Ellis
2000-02-08 - 29
Element-based audio indexing
•
Search for nonspeech audio databases
- e.g. Muscle Fish ‘SoundFisher’ for SFX libraries
•
Segment-level features
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Segment
feature
analysis
Query example
- well-performing features:
spectral centroid, dynamics, tonality ...
•
Each segment is an object
- not applicable to continuous recordings
Sound content analysis - Dan Ellis
2000-02-08 - 30
Results
Object-based audio indexing
•
Using ‘generic sound elements’
- decompose sound into elements; match subsets
- how to generalize?
- how to use segment-style features?
•
Form into objects for higher-order properties
- CASA-type object formation (onset, harmonicity)
Generic
element
analysis
Continuous audio
archive
Object
formation
Objects + properties
Element representations
Generic
element
analysis
(Object
formation)
Query example
rushing water
Symbolic query
Sound content analysis - Dan Ellis
Word-to-class
mapping
Properties alone
2000-02-08 - 31
Seach/
comparison
Results
Audio-video organization & retrieval
•
How it might work...
Boundaries &
structuring
Speech
processing
audio
Nonspeech
analysis
Audio-video
data
IR
engine
Indexing
terms
symbolic
analogic
A-V
match
video
Query
Video
features
Summarization
Sound content analysis - Dan Ellis
results
2000-02-08 - 32
AV indexing components
•
Recovering broad temporal structure
- speaker turns ; speech & music ; repetition
- characteristic of genres e.g. news shows
- indexible attributes in themselves
•
Posing queries:
- term-based
- proximity to examples
- dynamic audio-visual sketches?
•
How to define index/query terms?
- different kinds of terms: literal versus thematic
- machine learning of event classes
•
Summarization
- for displaying ‘hits’: impacts usability
- text / image / video / sound
- tricks e.g. to find most salient words
Sound content analysis - Dan Ellis
2000-02-08 - 33
Open issues in audio indexing
•
Information from speech
- multiple, confidence-tagged results? (not WER)
- prosodics; emphasis; speaking style
- speaker tracking, identity, character
•
Information from nonspeech
- how to define objects
- how to match symbolic search terms
•
Integrating audio and video
- combining information for search elements
- forms of query
•
Related applications
- ‘structured content’ encoders (e.g. MPEG4SA)
- semantic hearing aids ; robot monitors
Sound content analysis - Dan Ellis
2000-02-08 - 34
Outline
1
Sound content analysis
2
Speech recognition
3
Auditory scene analysis
4
Audio content indexing
5
Conclusions
Sound content analysis - Dan Ellis
2000-02-08 - 35
5
Conclusions:
The state of sound content analysis
•
Speech recognition:
- focussed application, practical results
- powerful statistical pattern recognition tools
- able to exploit large training sets
•
Computational Auditory Scene Analysis:
- real-world sounds are mixtures
- discover advanced ecological constraints
- results still rather preliminary
•
Content-based retrieval:
- compelling problem; forgiving application
- leveraging audio-visual correlations
- fertile ground for research
Sound content analysis - Dan Ellis
2000-02-08 - 36
Download