Recognition & Organization of Speech and Audio

advertisement
Recognition & Organization
of Speech and Audio
Dan Ellis
Electrical Engineering, Columbia University
<dpwe@ee.columbia.edu>
http://www.ee.columbia.edu/~dpwe/
Outline
1
Introducing LabROSA
2
Tandem modeling for robust ASR
3
Other current projects
4
Future projects
5
Summary & conclusions
ROSAtalk - Dan Ellis
2001-01-11 - 1
Lab
ROSA
1
Organization of sound mixtures
frq/Hz
4000
2000
0
0
2
4
6
8
10
12
time/s
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
•
ROSAtalk - Dan Ellis
Choir
Strings
Core operation:
Converting continuous, scalar signal
into discrete, symbolic representation
2001-01-11 - 2
Lab
ROSA
Positioning sound organization
Speech
recognition
Image/video
analysis
Sound
organization
Audio
processing
Comp. Aud.
Scene Analysis
Psychoacoustics
& physiology
Signal
processing
Pattern
recognition
Machine
learning
Artificial
intelligence
•
Draws on many techniques
•
Abuts/overlaps various areas
ROSAtalk - Dan Ellis
2001-01-11 - 3
Lab
ROSA
About auditory perception
•
Received waveform is a mixture
- two sensors, N signals ...
- need knowledge-based constraints
•
Psychoacoustics:
the study of human sound organization
- ‘auditory scene analysis’ (Bregman’90)
•
Auditory perception is ecologically grounded
- scene analysis is preconscious (→ illusions)
- perceived organization:
real-world objects + events (transient)
- subjective not canonical (ambiguity)
ROSAtalk - Dan Ellis
2001-01-11 - 4
Lab
ROSA
Key themes for LabROSA
http://www.ee.columbia.edu/~dpwe/LabROSA/
•
Sound organization: construct hierarchy
- at an instant (sources)
- along time (segmentation)
•
Scene analysis
- find attributes according to objects
- use attributes to form objects
- ... plus constraints of knowledge
•
Exploiting large data sets (the ASR lesson)
- supervised/labeled: pattern recognition
- unsupervised: structure discovery, clustering
•
Special cases:
- speech recognition
- other source-specific recognizers
•
... within a ‘complete explanation’
ROSAtalk - Dan Ellis
2001-01-11 - 5
Lab
ROSA
Outline
1
Introducing LabROSA
2
Tandem modeling for robust ASR
- ASR overview
- Tandem modeling
- Investigating the benefits
3
Other current projects
4
Future projects
5
Summary & conclusions
ROSAtalk - Dan Ellis
2001-01-11 - 6
Lab
ROSA
Automatic Speech Recognition (ASR)
•
Standard speech recognition structure:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
Acoustic
classifier
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
•
‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
•
Can use multiple streams...
ROSAtalk - Dan Ellis
2001-01-11 - 7
Lab
ROSA
Tandem speech recognition
(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU)
•
Neural net estimates phone posteriors;
but Gaussian mixtures model finer detail
•
Combine them!
Hybrid Connectionist-HMM ASR
Feature
calculation
Conventional ASR (HTK)
Noway
decoder
Neural net
classifier
Feature
calculation
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
s
ah
t
tn+w
Input
sound
Words
Phone
probabilities
Speech
features
Input
sound
Subword
likelihoods
Speech
features
Tandem modeling
Feature
calculation
Neural net
classifier
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
tn+w
Input
sound
•
ROSAtalk - Dan Ellis
Speech
features
Phone
probabilities
Subword
likelihoods
Words
Train net, then train GMM on net output
- GMM is ignorant of net output ‘meaning’
2001-01-11 - 8
Lab
ROSA
Words
Tandem system results
•
It works very well (‘Aurora’ noisy digits):
WER as a function of SNR for various Aurora99 systems
100
WER / % (log scale)
50
20
Average WER ratio to baseline:
HTK GMM:
100%
Hybrid:
84.6%
Tandem:
64.5%
Tandem + PC: 47.2%
10
5
HTK GMM baseline
Hybrid connectionist
Tandem
Tandem + PC
2
1
clean
-20
System-features
-15
-10
-5
SNR / dB (averaged over 4 noises)
0
5
Avg. WER 20-0 dB
Baseline WER ratio
HTK-mfcc
13.7%
100%
Neural net-mfcc
9.3%
84.5%
Tandem-mfcc
7.4%
64.5%
Tandem-msg+plp
6.4%
47.2%
ROSAtalk - Dan Ellis
2001-01-11 - 9
Lab
ROSA
Relative contributions
•
Approx relative impact on baseline WER ratio
for different component:
Combo over msg:
+20%
plp
Neural net
classifier
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
tn
tn+w
Input
sound
Pre-nonlinearity over +
posteriors: +12%
PCA
orthog'n
HTK
decoder
Gauss mix
models
s
msg
Neural net
classifier
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
KLT over
direct:
+8%
ah
Combo-into-HTK over
Combo-into-noway:
+15%
t
Words
tn
Combo over plp:
+20%
tn+w
Combo over mfcc:
+25%
NN over HTK:
+15%
Tandem over HTK:
+35%
Tandem over hybrid:
+25%
Tandem combo over HTK mfcc baseline: +53%
ROSAtalk - Dan Ellis
2001-01-11 - 10
Lab
ROSA
Inside Tandem systems:
What’s going on?
•
Visualizations of the net outputs
Clean
3
3
2
2
1
1
0
0
10
10
5
5
0
0
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
freq / kHz
4
phone
Spectrogram
5dB SNR to ‘Hall’ noise
4
phone
“one eight three”
(MFP_183A)
10
0
-10
-20
-30
-40 dB
Cepstral-smoothed
mel spectrum
Phone posterior
estimates
Hidden layer
linear outputs
freq / mel chan
7
ROSAtalk - Dan Ellis
5
4
0
•
6
0.2
0.4
0.6
time / s
0.8
1
3
1
0.5
0
20
10
0
-10
0
0.2
0.4
0.6
time / s
0.8
1
-20
Neural net normalizes away noise
2001-01-11 - 11
Lab
ROSA
Tandem feature space ‘magnification’
•
Neural net performs a nonlinear remapping of
the feature space
Smoothed spectrum
12
10
8
6
4
2
20
Linear outputs for /r/, /iy/, /w/
10
0
Posteriors for /r/, /iy/, /w/
1
0
0.7
0.75
0.8
time / s
- small changes across critical boundaries
result in large output changes
ROSAtalk - Dan Ellis
2001-01-11 - 12
Lab
ROSA
Outline
1
Introducing LabROSA
2
Tandem modeling for robust ASR
3
Other current projects
- Alarm sound detection
- Computational Auditory Scene Analysis
- Multi-source and missing-data recognition
- The Meeting Recorder project
4
Future projects
5
Summary & conclusions
ROSAtalk - Dan Ellis
2001-01-11 - 13
Lab
ROSA
freq / Hz
Alarm sound detection
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
- build a generic detector?
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
ROSAtalk - Dan Ellis
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
representation of energy in time-frequency
formation of atomic elements
grouping by common properties (onset &c.)
classify by attributes...
2001-01-11 - 14
Lab
ROSA
Computational Auditory Scene Analysis
(CASA)
•
input
mixture
Implement psychoacoustic theory? (Brown’92)
Front end
signal
features
(maps)
Object
formation
discrete
objects
Source
groups
Grouping
rules
freq
onset
time
period
frq.mod
- what are the features? how are they used?
•
Additional ‘knowledge’ needed (Klassner’96)
3760 Hz
Buzzer-Alarm
2540 Hz
2230 Hz
2350 Hz
Glass-Clink
1675 Hz
1475 Hz
950 Hz
500 Hz
460 Hz
420 Hz
Phone-Ring
1.0
Siren-Chirp
2.0
TIME
ROSAtalk - Dan Ellis
3.0
4.0 sec
1.0
2.0
3.0
4.0 sec
TIME
2001-01-11 - 15
Lab
ROSA
Prediction-driven CASA
•
Data-driven (bottom-up) fails for noisy,
ambiguous sounds (most mixtures!)
•
Need top-down constraints:
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
- fit vocabulary of generic elements to sound
... bottom of a hierarchy?
- account for entire scene
- driven by prediction failures
- pursue alternative hypotheses
ROSAtalk - Dan Ellis
2001-01-11 - 16
Lab
ROSA
PDCASA example
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
ROSAtalk - Dan Ellis
1
2
3
4
5
6
7
8
9
time/s
dB
2001-01-11 - 17
Lab
ROSA
Missing data recognition & CASA
(with Barker, Cooke, Green/Sheffield)
•
Missing-data recognition
- integrate across ‘don’t-know’ values
- ‘perfect’ mask → excellent performance in noise
•
Multi-source decoder
- Viterbi search of sound-fragment interpretations
"1754" + noise
Mask based on
stationary noise estimate
Mask split into subbands
`Grouping' applied within bands:
Spectro-Temporal Proximity
Common Onset/Offset
Multisource
Decoder
•
ROSAtalk - Dan Ellis
"1754"
CASA for masks/fragments
- larger fragments → quicker search
2001-01-11 - 18
Lab
ROSA
Meeting recorder
(with ICSI, UW, SRI, IBM)
•
Microphones in conventional meetings
- for transcription/summarization/retrieval
- informal, overlapped speech
•
Data collection (ICSI and ...):
- 10s of hours collected, ongoing
- now being transcribed
ROSAtalk - Dan Ellis
2001-01-11 - 19
Lab
ROSA
Meeting recorder: Research issues
•
Preliminary analysis
- transcription & forced alignment
- ground truth in turns/overlaps
- preliminary distant-mic recordings
•
Research areas
- meeting dialog: overlaps, turns etc.
- language modeling for meetings
- feature design for distant acoustics
•
Applications
- information retrieval from meetings
- ‘mapping’ meeting content
- sociological analysis of meeting behavior
ROSAtalk - Dan Ellis
2001-01-11 - 20
Lab
ROSA
Outline
1
Introducing LabROSA
2
Tandem modeling for robust ASR
4
Other current projects
4
Future projects
- Audio Content-Based Retrieval
- A ‘machine listener’
- Audio-video-text content analysis
5
Summary & conclusions
ROSAtalk - Dan Ellis
2001-01-11 - 21
Lab
ROSA
Audio Information Retrieval
•
Searching in a database of audio
- speech .. use ASR
- text annotations .. search them
- sound effects library?
•
e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions
- search by proximity in (weighted) feature space
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
- features are ‘global’ for each soundfile,
no attempt to separate mixtures
ROSAtalk - Dan Ellis
2001-01-11 - 22
Lab
ROSA
CASA for audio retrieval
•
When audio material contains mixtures,
global features are insufficient
•
Retrieval based on element/object analysis:
Generic
element
analysis
Continuous audio
archive
Object
formation
Objects + properties
Element representations
Generic
element
analysis
Seach/
comparison
Results
(Object
formation)
Query example
rushing water
Symbolic query
Word-to-class
mapping
Properties alone
- features are calculated over grouped subsets
ROSAtalk - Dan Ellis
2001-01-11 - 23
Lab
ROSA
A ‘machine listener’
•
Goal: Unsupervised structure discovery
Broadcast
receiver
Audio
Acoustic event
extraction
Segment
features
Unsupervised
clustering
Class
definitions
Event
templates
Interactive
refinement
•
ROSAtalk - Dan Ellis
What can you do with a large unlabeled training
set (e.g. broadcast)?
- bootstrap learning: look for common patterns
- have to learn generalizations in parallel:
e.g. self-organizing maps, EM HMMs
- post-filtering by humans may find ‘meaning’ in
clusters
2001-01-11 - 24
Lab
ROSA
Audio-video-text content analysis
(with Shih-Fu Chang, Kathleen McKeown)
Audio/Video material
Scene alignment and
segmentation
A/V object-based
feature extraction
Clustering and
structure discovery
• Question
answering
• Browsing
• Search
• Summary
Multimedia
Lexicon
Term
extraction
Text sources
Interactive
refinement
•
Audio and video provide complementary info
- correlate object features to define templates?
•
Associated text annotations provide a very
small amount of labeling
- .. but for a very large number of examples
– sufficient to obtain purchase?
- build a ‘multimedia lexicon’ for questionanswering
ROSAtalk - Dan Ellis
2001-01-11 - 25
Lab
ROSA
Outline
1
Introducing LabROSA
2
Tandem modeling for robust ASR
4
Other current projects
4
Future projects
5
Summary & conclusions
ROSAtalk - Dan Ellis
2001-01-11 - 26
Lab
ROSA
Applications for sound organization
What do people do with their ears?
•
Human-computer interface
- .. includes knowing when (& why) you’ve failed
•
Robots
- intelligence requires perceptual awareness
- Sony’s AIBO: dog-hearing
•
Archive indexing & retrieval
- pure audio archives
- true multimedia content analysis
•
Content ‘understanding’
- intelligent classification & summarization
•
Autonomous monitoring
•
Broader ‘structure discovery’ algorithms
ROSAtalk - Dan Ellis
2001-01-11 - 27
Lab
ROSA
DOMAINS
Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
ROSAtalk - Dan Ellis
•
•
•
•
•
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
2001-01-11 - 28
Lab
ROSA
Conclusions
•
New classification schemes for ASR
- ... combining multiple approaches/sources
•
But sound is more than just speech!
- speech is a special case
- need to deal with the ‘other stuff’
•
Object-based analysis
- it’s what people do
- the world presents acoustic mixtures
•
Whole-scene representation
- it’s what people do
- provides mutual constraints of overlap
•
Broad range of approaches
for a broad range of phenomena
ROSAtalk - Dan Ellis
2001-01-11 - 29
Lab
ROSA
Download