Recognition & Organization of Speech and Audio

advertisement
Recognition & Organization
of Speech and Audio
Dan Ellis
Electrical Engineering, Columbia University
<dpwe@ee.columbia.edu>
Outline
1
Sound ‘organization’
2
Background & related work
3
Existing projects
4
Future projects
5
Summary & conclusions
ROSAtalk - Dan Ellis
2000-10-13 - 1
Lab
ROSA
1
Organization of sound mixtures
frq/Hz
4000
2000
0
0
2
4
6
8
10
12
time/s
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
•
ROSAtalk - Dan Ellis
Choir
Strings
Core operation:
Converting continuous, scalar signal
into discrete, symbolic representation
2000-10-13 - 2
Lab
ROSA
Positioning sound organization
Speech
recognition
Image/video
analysis
Sound
organization
Audio
processing
Psychoacoustics
& physiology
Signal
processing
Pattern
recognition
Machine
learning
Artificial
intelligence
•
Draws on many techniques
•
Abuts/overlaps various areas
ROSAtalk - Dan Ellis
2000-10-13 - 3
Lab
ROSA
About auditory perception
•
Received waveform is a mixture
- two sensors, N signals ...
- need knowledge-based constraints
•
Psychoacoustics:
the study of human sound organization
- ‘auditory scene analysis’ (Bregman’90)
•
Auditory perception is ecologically grounded
- scene analysis is preconscious (→ illusions)
- perceived organization:
real-world objects + events (transient)
- subjective not canonical (ambiguity)
ROSAtalk - Dan Ellis
2000-10-13 - 4
Lab
ROSA
Key themes for LabROSA
•
Sound organization
- recovering/constructing abstraction hierarchy
- at an instant (sources)
- along time (segmentation)
•
Scene analysis
- need to find attributes according to objects
- use attributes to form objects
- ... plus constraints of knowledge
•
Exploiting large data sets (the ASR lesson)
- supervised/labelled: pattern recognition
- unsupervised: structure discovery, clustering
•
Special-purpose cases:
- speech recognition
- source-specific recognizers
•
... within a ‘complete explanation’
ROSAtalk - Dan Ellis
2000-10-13 - 5
Lab
ROSA
Applications for sound organization
What do people do with their ears?
•
Robots
- intelligence requires awareness
- Sony’s AIBO: dog-hearing
•
Human-computer interface
- .. includes knowing when (& why) you’ve failed
•
Archive indexing & retrieval
- pure audio archives
- true multimedia content analysis
•
Content ‘understanding’
- intelligent classification & summarization
•
Autonomous monitoring
•
Broader ‘structure discovery’ algorithms
ROSAtalk - Dan Ellis
2000-10-13 - 6
Lab
ROSA
Outline
1
Sound ‘organization’
2
Background & related work
- Audio coding & compression
- Automatic Speech Recognition
- Computational Auditory Scene Analysis
- Multimedia information retrieval
3
Existing projects
4
Future projects
5
Summary & conclusions
ROSAtalk - Dan Ellis
2000-10-13 - 7
Lab
ROSA
Audio coding & compression
•
Goal is reconstruction, not abstraction
•
But criteria are ‘subjective’:
want same percept, not same waveform
•
MPEG-Audio:
- filterbanks
- information-theoretic coding
- psychoacoustic masking of quantization noise
•
MPEG-4 ‘Structured Audio’
- computer music synthesis model
- instrument definition + control stream
- automatic analysis?
ROSAtalk - Dan Ellis
2000-10-13 - 8
Lab
ROSA
Automatic Speech Recognition (ASR)
•
Standard speech recognition structure:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
Acoustic
classifier
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
•
‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
•
Segmentation of speech & nonspeech
- ... recognizer wouldn’t notice!
ROSAtalk - Dan Ellis
2000-10-13 - 9
Lab
ROSA
Spoken document retrieval
•
Text-based IR on ASR transcripts
- e.g. news broadcasts (CMU’s Informedia, Thisl)
•
Recognition errors are not the limiting factor
- TREC-98 results: average precision 0.5→0.4
•
Weak at word level, but OK over paragraphs
- replay the audio, don’t show the text!
ROSAtalk - Dan Ellis
2000-10-13 - 10
Lab
ROSA
Computational Auditory Scene Analysis
(CASA)
•
input
mixture
Implement psychoacoustic theory? (Brown’92)
Front end
signal
features
(maps)
Object
formation
discrete
objects
Source
groups
Grouping
rules
freq
onset
time
period
frq.mod
- what are the features? how are they used?
•
Top down constraints are needed (Klassner’96)
3760 Hz
Buzzer-Alarm
2540 Hz
2230 Hz
2350 Hz
Glass-Clink
1675 Hz
1475 Hz
950 Hz
500 Hz
460 Hz
420 Hz
Phone-Ring
1.0
Siren-Chirp
2.0
TIME
ROSAtalk - Dan Ellis
3.0
4.0 sec
1.0
2.0
3.0
4.0 sec
TIME
2000-10-13 - 11
Lab
ROSA
Audio Information Retrieval
•
Searching in a database of audio
- speech .. use ASR
- text annotations .. search them
- sound effects library?
•
e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions
- search by proximity in (weighted) feature space
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
- features are ‘global’ for each soundfile,
no attempt to separate mixtures
- segmentation...
ROSAtalk - Dan Ellis
2000-10-13 - 12
Lab
ROSA
Music analysis
•
Automatic transcription (score recovery)
- classic ‘hard problem’: can people do it even?
- recent success in reduced forms
e.g. melody, drum track (Goto’00)
•
Instrument identification
- ideas from speaker identification (basic PR)
+ instrument family hierarchies (Martin’99)
•
Fingerprinting
- spot recordings despite noise, distortion
- relies on perceptual invariants
•
Music clustering
- e.g. music recommendation based on signal
- correlate objective features with user ratings?
ROSAtalk - Dan Ellis
2000-10-13 - 13
Lab
ROSA
Multimedia description
•
MPEG-7 ‘Metadata’
- MPEG is known for audio/video compression
standards;
- also develop standards for search and indexing
•
MPEG-7 is a standard format for metadata:
Well-defined categories for content description
•
Focus is on framework & infrastructure
•
Audio descriptor categories:
- from ASR
- from computer music community
- uses still to emerge
ROSAtalk - Dan Ellis
2000-10-13 - 14
Lab
ROSA
Outline
1
Sound ‘organization’
2
Background & related work
3
Existing projects
- Acoustic change detection
- Robust speech recognition
- Nonspeech event detection
- Prediction-driven CASA
4
Future projects
5
Summary & conclusions
ROSAtalk - Dan Ellis
2000-10-13 - 15
Lab
ROSA
Acoustic change detection
(with Williams/Sheffield, Ferreiros/UPMadrid)
•
Approaches:
- ‘metric’: find instants of maximal change
- ‘model-based’: best alignment of model set
- ‘bayesian’: generate models when warranted
Dynamism vs. Entropy for 2.5 second segments of speecn/music
3.5
Spectrogram
frq/Hz
Speech
Music
Speech+Music
3
4000
2.5
2000
2
speech
music
Entropy
0
speech+music
1.5
40
1
20
0
0
2
4
6
8
10
12
time/s
0.5
Posteriors
0
0
0.05
0.1
0.15
0.2
Dynamism
0.25
•
Typically agnostic about underlying problem
- use any features, find any changes
•
Good for ASR adaptation, otherwise...
ROSAtalk - Dan Ellis
2000-10-13 - 16
0.3
Lab
ROSA
Speech feature combination
(with Bilmes/UW, Hermansky/OGI, ICSI)
•
‘Multistream’ approaches
Feature 1
calculation
Acoustic
classifier
HMM
decoder
Posterior
combination
Feature 2
calculation
Input
sound
Acoustic
classifier
Word
hypotheses
Phone
probabilities
Speech
features
- streams can correct each other → big gains
•
Which feature streams to combine?
- low mutual information between classifiers
indicates complementary streams
PLPb
MSGa
MSGb
Conditional Mutual Information between feature streams
0.5
PLPa
0.4
0.3
0.2
0.1
0
PLPa
ROSAtalk - Dan Ellis
PLPb
MSGa
MSGb
CMI/bits
2000-10-13 - 17
Lab
ROSA
Tandem speech recognition
(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU)
•
Neural net estimates phone posteriors;
but Gaussian mixtures model finer detail
•
Combine them!
Hybrid Connectionist-HMM ASR
Feature
calculation
Conventional ASR (HTK)
Noway
decoder
Neural net
classifier
Feature
calculation
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
s
ah
t
tn+w
Input
sound
Speech
features
Words
Phone
probabilities
Input
sound
Subword
likelihoods
Speech
features
Words
Tandem modeling
Feature
calculation
Neural net
classifier
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
tn+w
Input
sound
•
ROSAtalk - Dan Ellis
Speech
features
Phone
probabilities
Subword
likelihoods
Words
50% relative improvement over GMMs alone
- different statistical modeling schemes
get different info from same training data
2000-10-13 - 18
Lab
ROSA
Alarm sound detection
freq / Hz
•
Deconstructing sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
- representation of energy in time-frequency
- formation of atomic elements
- grouping by common properties (onset &c.)
•
ROSAtalk - Dan Ellis
Alarm sounds have particular structure
- people ‘know them when they hear them’
- build a generic detector?
2000-10-13 - 19
Lab
ROSA
Prediction-driven CASA
•
input
mixture
Data-driven (bottom-up) fails for noisy,
ambiguous sounds (most mixtures!)
Front end
signal
features
(maps)
•
Object
formation
discrete
objects
Grouping
rules
Source
groups
Need top-down constraints:
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
- fit vocabulary of generic elements to sound
... bottom of a hierarchy?
- account for entire scene
- driven by prediction failures
- pursue alternative hypotheses
ROSAtalk - Dan Ellis
2000-10-13 - 20
Lab
ROSA
PDCASA example
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
ROSAtalk - Dan Ellis
1
2
3
4
5
6
7
8
9
time/s
dB
2000-10-13 - 21
Lab
ROSA
Outline
1
Sound ‘organization’
2
Background & related work
3
Existing projects
4
Future projects
- Meeting Recorder
- Missing-data recognition & CASA for ASR
- Structure from audio-video features
- Speech & speaker recognition
- Music organization
- Audio archive structure discovery
somewhat
concrete
provisional
5
ROSAtalk - Dan Ellis
Summary & conclusions
2000-10-13 - 22
Lab
ROSA
Meeting recorder
(with ICSI, UW, SRI, IBM)
•
Microphones in conventional meetings
- for transcription/summarization/retrieval
- informal, overlapped speech
•
Data collection (ICSI and ...):
•
Research: ASR, nonspeech, organization
- unprecedented data, new applications
ROSAtalk - Dan Ellis
2000-10-13 - 23
Lab
ROSA
Missing data recognition & CASA
(with Barker, Cooke, Green/Sheffield)
•
Missing-data recognition
- integrate across ‘don’t-know’ values
- ‘perfect’ mask → excellent performance in noise
•
Multi-source decoder
- Viterbi search of sound-fragment interpretations
"1754" + noise
Mask based on
stationary noise estimate
Mask split into subbands
`Grouping' applied within bands:
Spectro-Temporal Proximity
Common Onset/Offset
Multisource
Decoder
•
ROSAtalk - Dan Ellis
"1754"
CASA for masks/fragments
- larger fragments → quicker search
2000-10-13 - 24
Lab
ROSA
Structure from audio-video features
(Peng Xu)
•
HMM modeling of sports video
inter
replay
start
middle
end
•
Distribution of camera motion labels,
color features
- also need within-state sequential structure
•
Add features from audio
- could be orthogonal/complementary
•
Audio feature toolkit?
- simple feature vectors, boundaries, classes
- wealth of potential applications!
ROSAtalk - Dan Ellis
2000-10-13 - 25
Lab
ROSA
Speech & speaker recognition
•
Words are not enough;
Confidence-tagged alternate word hypotheses
•
Other useful information:
- speaker change detection
- speaker characterization
- phrasing & timing
- prosodic cues to dialog state
- laughter, pauses, etc.
•
Integration with other analyses
- segmentation for adaptation
- nonspeech events to ignore
- video-derived information...
ROSAtalk - Dan Ellis
2000-10-13 - 26
Lab
ROSA
Music organization
•
Music is a special case
- lots of structure
- highly significant
•
Trick is to find meaningful, tractable questions
- boundary between speech and music?
•
New (counter-intuitive) approaches?
- perceive as whole, not by voice (Scheirer’00)
→ global features for chord structures
- generic ‘event’ cues + local feature classification
- more provisional notion of instruments/voices
ROSAtalk - Dan Ellis
2000-10-13 - 27
Lab
ROSA
CASA for audio retrieval
•
Muscle Fish system uses global features:
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
•
Mixtures → need elements & objects:
Generic
element
analysis
Continuous audio
archive
Object
formation
Objects + properties
Element representations
Generic
element
analysis
Seach/
comparison
Results
(Object
formation)
Query example
rushing water
Symbolic query
ROSAtalk - Dan Ellis
Word-to-class
mapping
Properties alone
features calculated on grouped subsets
2000-10-13 - 28
Lab
ROSA
Audio archive structure discovery
•
What can you do with a large unlabeled training
set (e.g. multimedia clips from the web)?
- bootstrap learning: look for common patterns
- have to learn generalizations in parallel:
e.g. self-organizing maps, EM HMMs
- post-filtering by humans may find ‘meaning’ in
clusters
•
Associated text annotations provide a very
small amount of labeling
- .. but for a very large number of examples
– sufficient to obtain purchase?
- maximize label utility through NLP-type
operations (expansion, disambiguation etc.)
- goal is automatic term-to-feature mapping
for term-based content queries
ROSAtalk - Dan Ellis
2000-10-13 - 29
Lab
ROSA
Outline
1
Sound ‘organization’
2
Background & related work
3
Existing projects
4
Future projects
5
Summary & conclusions
ROSAtalk - Dan Ellis
2000-10-13 - 30
Lab
ROSA
DOMAINS
Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
ROSAtalk - Dan Ellis
•
•
•
•
•
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
2000-10-13 - 31
Lab
ROSA
Conclusions
•
Sound is more than just speech!
- speech is a special case
- most auditory perceivers don’t understand
speech
•
Object-based analysis is critical
- it’s what people do
- the world presents acoustic mixtures
•
Whole-scene representation is the way
- it’s what people do
- provides mutual constraints of overlap
•
Broad range of approaches
for a broad range of phenomena
ROSAtalk - Dan Ellis
2000-10-13 - 32
Lab
ROSA
Download