Recognition & Organization of Speech and Audio

advertisement
Recognition & Organization
of Speech and Audio
Dan Ellis
Electrical Engineering, Columbia University
<dpwe@ee.columbia.edu>
http://www.ee.columbia.edu/~dpwe/
Outline
1
Introducing LabROSA
2
Robust speech recognition
3
General audio analysis
4
Summary
ROSAtalk - Dan Ellis
2001-06-21 - 1
Lab
ROSA
1
Organization of sound mixtures
frq/Hz
4000
2000
0
0
2
4
6
8
10
12
time/s
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
•
Choir
Strings
Core operation:
Converting continuous, scalar signal
into discrete, symbolic representation
ROSAtalk - Dan Ellis
2001-06-21 - 2
Lab
ROSA
Positioning sound organization
Speech
recognition
Image/video
analysis
Sound
organization
Audio
processing
Comp. Aud.
Scene Analysis
Psychoacoustics
& physiology
Signal
processing
Pattern
recognition
Machine
learning
Artificial
intelligence
•
Draws on many techniques
•
Abuts/overlaps various areas
ROSAtalk - Dan Ellis
2001-06-21 - 3
Lab
ROSA
About auditory perception
•
Received waveform is a mixture
- two sensors, N signals ...
- need knowledge-based constraints
•
Psychoacoustics:
the study of human sound organization
- ‘auditory scene analysis’ (Bregman’90)
•
Auditory perception is ecologically grounded
- scene analysis is preconscious (→ illusions)
- perceived organization:
real-world objects + events (transient)
- subjective not canonical (ambiguity)
ROSAtalk - Dan Ellis
2001-06-21 - 4
Lab
ROSA
Key themes for LabROSA
http://www.ee.columbia.edu/~dpwe/LabROSA/
•
Sound organization: construct hierarchy
- at an instant (sources)
- along time (segmentation)
•
Scene analysis
- find attributes according to objects
- use attributes to form objects
- ... plus constraints of knowledge
•
Exploiting large data sets (the ASR lesson)
- supervised/labeled: pattern recognition
- unsupervised: structure discovery, clustering
•
Special cases:
- speech recognition
- other source-specific recognizers
•
... within a ‘complete explanation’
ROSAtalk - Dan Ellis
2001-06-21 - 5
Lab
ROSA
Outline
1
Introducing LabROSA
2
Robust speech recognition
- ASR overview
- Tandem modeling
- Missing data and multisource decoding
3
General audio analysis
4
Summary
ROSAtalk - Dan Ellis
2001-06-21 - 6
Lab
ROSA
Automatic Speech Recognition (ASR)
•
Standard speech recognition structure:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Acoustic
classifier
Word models
s
ah
phone probabilities
t
HMM
decoder
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
phone / word
sequence
Understanding/
application...
•
‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
•
Can use multiple streams...
ROSAtalk - Dan Ellis
2001-06-21 - 7
Lab
ROSA
Tandem speech recognition
(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI)
•
Neural net estimates phone posteriors;
but Gaussian mixtures model finer detail
•
Combine them!
Hybrid Connectionist-HMM ASR
Feature
calculation
Conventional ASR (HTK)
Noway
decoder
Neural net
classifier
Feature
calculation
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
s
ah
t
tn+w
Input
sound
Words
Phone
probabilities
Speech
features
Input
sound
Subword
likelihoods
Speech
features
Tandem modeling
Feature
calculation
Neural net
classifier
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
tn+w
Input
sound
•
Speech
features
Phone
probabilities
Subword
likelihoods
Words
Train net, then train GMM on net output
- GMM is ignorant of net output ‘meaning’
ROSAtalk - Dan Ellis
2001-06-21 - 8
Lab
ROSA
Words
Tandem system results
•
It works very well (‘Aurora’ noisy digits):
WER as a function of SNR for various Aurora99 systems
100
WER / % (log scale)
50
20
Average WER ratio to baseline:
HTK GMM:
100%
Hybrid:
84.6%
Tandem:
64.5%
Tandem + PC: 47.2%
10
5
HTK GMM baseline
Hybrid connectionist
Tandem
Tandem + PC
2
1
clean
-20
System-features
-15
-10
-5
SNR / dB (averaged over 4 noises)
0
5
Avg. WER 20-0 dB
Baseline WER ratio
HTK-mfcc
13.7%
100%
Neural net-mfcc
9.3%
84.5%
Tandem-mfcc
7.4%
64.5%
Tandem-msg+plp
6.4%
47.2%
ROSAtalk - Dan Ellis
2001-06-21 - 9
Lab
ROSA
Relative contributions
•
Approx relative impact on baseline WER ratio
for different component:
Combo over msg:
+20%
plp
Neural net
classifier
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
tn
tn+w
Input
sound
Pre-nonlinearity over +
posteriors: +12%
PCA
orthog'n
HTK
decoder
Gauss mix
models
s
msg
Neural net
classifier
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
KLT over
direct:
+8%
ah
Combo-into-HTK over
Combo-into-noway:
+15%
t
Words
tn
Combo over plp:
+20%
tn+w
Combo over mfcc:
+25%
NN over HTK:
+15%
Tandem over HTK:
+35%
Tandem over hybrid:
+25%
Tandem combo over HTK mfcc baseline: +53%
ROSAtalk - Dan Ellis
2001-06-21 - 10
Lab
ROSA
Inside Tandem systems:
What’s going on?
•
Visualizations of the net outputs
Clean
3
3
2
2
1
1
0
0
10
10
5
5
0
0
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
freq / kHz
4
phone
Spectrogram
5dB SNR to ‘Hall’ noise
4
phone
“one eight three”
(MFP_183A)
10
0
-10
-20
-30
-40 dB
Cepstral-smoothed
mel spectrum
Phone posterior
estimates
Hidden layer
linear outputs
freq / mel chan
7
5
4
0
•
6
0.2
0.4
0.6
time / s
0.8
1
3
1
0.5
0
20
10
0
-10
0
0.2
0.4
0.6
time / s
0.8
1
-20
Neural net normalizes away noise
ROSAtalk - Dan Ellis
2001-06-21 - 11
Lab
ROSA
Tandem for large vocabulary recognition
•
CI Tandem front end + CD LVCSR back end
PLP feature
calculation
Neural net
classifier 1
h#
pcl
bcl
tcl
dcl
C0
C1
Tandem
feature
calculation
SPHINX
recognizer
C2
Ck
tn
tn+w
+
Pre-nonlinearity
outputs
Input
sound
MSG feature
calculation
Subword
likelihoods
Neural net
classifier 2
C1
s
ah
t
Words
MLLR
adaptation
h#
pcl
bcl
tcl
dcl
C0
HMM
decoder
GMM
classifier
PCA
decorrelation
C2
Ck
tn
tn+w
•
Tandem benefits reduced:
75
mfc
plp
tandem1
tandem2
70
65
WER / %
60
55
50
45
40
35
30
ROSAtalk - Dan Ellis
Context
Independent
Context
Dependent
Context Dep.
+MLLR
2001-06-21 - 12
Lab
ROSA
‘Tandem-domain’ processing
(with Manuel Reyes)
•
Can we improve the ‘tandem’ features with
conventional processing (deltas,
normalization)?
Neural net
...
Norm /
deltas
?
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
tn
PCA
decorrelation
Rank
reduce
?
Norm /
deltas
?
GMM
classifier
...
tn+w
•
Somewhat..
Processing
Avg. WER 20-0 dB
Baseline WER ratio
Tandem PLP mismatch baseline (24 els)
11.1%
70.3%
Rank reduce @ 18 els
11.8%
77.1%
Delta → PCA
9.7%
60.8%
PCA → Norm
9.0%
58.8%
Delta → PCA → Norm
8.3%
53.6%
ROSAtalk - Dan Ellis
2001-06-21 - 13
Lab
ROSA
Missing data recognition
(with Cooke, Green, Barker @ Sheffield)
•
Noisy training seems to miss the point
- rather have single ‘clean’ models
•
Use missing feature theory...
- integrate over missing data dimensions xm
p( x q) =
∫ p ( xm
x g, q ) p ( x g q ) dx m
- trick is finding good/bad data mask
- soft classification improves
"1754" + noise
90
AURORA 2000 - Test Set A
80
Missing-data
Recognition
WER
70
MD Discrete SNR
MD Soft SNR
HTK clean training
HTK multi-condition
60
50
40
"1754"
Mask based on
stationary noise estimate
30
20
10
0
-5
ROSAtalk - Dan Ellis
0
5
10
15
2001-06-21 - 14
20
Lab
ROSA
Clean
Multi-source decoding
(Jon Barker @ Sheffield)
•
Search of sound-fragment interpretations
"1754" + noise
Mask based on
stationary noise estimate
Mask split into subbands
`Grouping' applied within bands:
Spectro-Temporal Proximity
Common Onset/Offset
Multisource
Decoder
•
"1754"
CASA for masks/fragments
- larger fragments → quicker search
ROSAtalk - Dan Ellis
2001-06-21 - 15
Lab
ROSA
Outline
1
Introducing LabROSA
2
Robust speech recognition
3
General audio analysis
- Alarm sound detection
- Computational Auditory Scene Analysis
- Music analysis
- The Meeting Recorder project
4
Summary
ROSAtalk - Dan Ellis
2001-06-21 - 16
Lab
ROSA
freq / Hz
Alarm sound detection
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
- build a generic detector?
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
-
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
representation of energy in time-frequency
formation of atomic elements
grouping by common properties (onset &c.)
classify by attributes...
ROSAtalk - Dan Ellis
2001-06-21 - 17
Lab
ROSA
Computational Auditory Scene Analysis
(CASA)
•
input
mixture
Implement psychoacoustic theory? (Brown’92)
signal
features
Front end
(maps)
Object
formation
discrete
objects
Source
groups
Grouping
rules
freq
onset
time
period
frq.mod
- what are the features? how are they used?
•
Additional ‘knowledge’ needed (Klassner’96)
3760 Hz
Buzzer-Alarm
2540 Hz
2230 Hz
2350 Hz
Glass-Clink
1675 Hz
1475 Hz
950 Hz
500 Hz
460 Hz
420 Hz
Phone-Ring
1.0
Siren-Chirp
2.0
TIME
ROSAtalk - Dan Ellis
3.0
4.0 sec
1.0
2.0
3.0
4.0 sec
TIME
2001-06-21 - 18
Lab
ROSA
Prediction-driven CASA
•
Data-driven (bottom-up) fails for noisy,
ambiguous sounds (most mixtures!)
•
Need top-down constraints:
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
- fit vocabulary of generic elements to sound
... bottom of a hierarchy?
- account for entire scene
- driven by prediction failures
- pursue alternative hypotheses
ROSAtalk - Dan Ellis
2001-06-21 - 19
Lab
ROSA
PDCASA example
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
1
ROSAtalk - Dan Ellis
2
3
4
5
6
7
8
9
time/s
dB
2001-06-21 - 20
Lab
ROSA
Music analysis: Structure recovery
(with Rob Turetsky)
•
Structure recovery by similarity matrices
(after Foote)
- similarity distance measure?
- segmentation & repetition structure
- interpretation at different scales:
notes, phrases, movements
- incorporating musical knowledge:
‘theme similarity’
ROSAtalk - Dan Ellis
2001-06-21 - 21
Lab
ROSA
Music analysis: Lyrics extraction
(with Adam Berenzweig)
•
Vocal content is highly salient,
useful for retrieval
•
Can we find the singing?
Use an ASR classifier:
freq / kHz
music (no vocals #1)
singing (vocals #17 + 10.5s)
4
2
0
singing
phone class
Posteriogram Spectrogram
speech (trnset #58)
0
1
2
time / sec
0
1
2
time / sec
0
1
2
time / sec
•
Frame error rate ~20% for segmentation based
on posterior-feature statistics
•
Lyric segmentation + transcribed lyrics
→ training data for lyrics ASR...
ROSAtalk - Dan Ellis
2001-06-21 - 22
Lab
ROSA
The Meeting Recorder project
(with ICSI, UW, SRI, IBM)
•
Microphones in conventional meetings
- for summarization/retrieval/behavior analysis
- informal, overlapped speech
•
Data collection (ICSI, UW, ...):
- 100 hours collected, ongoing transcription
- headsets + tabletop + ‘PDA’
ROSAtalk - Dan Ellis
2001-06-21 - 23
Lab
ROSA
Meeting recorder: Difficult data
Spkr A
speaker
active
speaker B
cedes floor
Spkr B
Spkr C
interruptions
Spkr D
breath
noise
Spkr E
crosstalk
level/dB
40
Table
top
20
0
120
•
125
130
135
140
145
150
155
time / secs
Cross-correlation gives speaker turns, motion
participant movement
mr-2000-11-02-1440
Spkr C
freq / chan
20
15
10
Coupling
C→A
delay / samples
5
20
15
10
5
Coupling
C → Tabletop
delay / samples
20
15
10
5
1020
ROSAtalk - Dan Ellis
1020.5
1021
1021.5
1022
1022.5
1023
1023.5
1024
1024.5
time / sec
2001-06-21 - 24
Lab
ROSA
Outline
1
Introducing LabROSA
2
Robust speech recognition
3
General audio analysis
4
Summary
- Some future project ideas
- LabROSA Summary
ROSAtalk - Dan Ellis
2001-06-21 - 25
Lab
ROSA
Future: Audio Information Retrieval
•
Searching in a database of audio
- speech .. use ASR
- text annotations .. search them
- sound effects library?
•
e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions
- search by proximity in (weighted) feature space
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
- features are ‘global’ for each soundfile,
no attempt to separate mixtures
ROSAtalk - Dan Ellis
2001-06-21 - 26
Lab
ROSA
CASA for audio retrieval
•
When audio material contains mixtures,
global features are insufficient
•
Retrieval based on element/object analysis:
Generic
element
analysis
Continuous audio
archive
Object
formation
Objects + properties
Element representations
Generic
element
analysis
Seach/
comparison
Results
(Object
formation)
Query example
rushing water
Symbolic query
Word-to-class
mapping
Properties alone
- features are calculated over grouped subsets
ROSAtalk - Dan Ellis
2001-06-21 - 27
Lab
ROSA
Future: ‘Machine listener’
•
Goal: Unsupervised structure discovery
Broadcast
receiver
Audio
Acoustic event
extraction
Segment
features
Unsupervised
clustering
Class
definitions
Event
templates
Interactive
refinement
•
What can you do with a large unlabeled training
set (e.g. broadcast)?
- bootstrap learning: look for common patterns
- have to learn generalizations in parallel:
e.g. self-organizing maps, EM HMMs
- post-filtering by humans may find ‘meaning’ in
clusters
ROSAtalk - Dan Ellis
2001-06-21 - 28
Lab
ROSA
Audio-video-text content analysis
(with Shih-Fu Chang, Kathleen McKeown)
Audio/Video material
Scene alignment and
segmentation
A/V object-based
feature extraction
Clustering and
structure discovery
• Question
answering
• Browsing
• Search
• Summary
Multimedia
Lexicon
Term
extraction
Text sources
Interactive
refinement
•
Audio and video provide complementary info
- correlate object features to define templates?
•
Associated text annotations provide a very
small amount of labeling
- .. but for a very large number of examples
– sufficient to obtain purchase?
- build a ‘multimedia lexicon’ for questionanswering
ROSAtalk - Dan Ellis
2001-06-21 - 29
Lab
ROSA
Summary:
Applications for sound organization
What do people do with their ears?
•
Human-computer interface
- .. includes knowing when (& why) you’ve failed
•
Robots
- intelligence requires perceptual awareness
- Sony’s AIBO: dog-hearing
•
Archive indexing & retrieval
- pure audio archives
- true multimedia content analysis
•
Content ‘understanding’
- intelligent classification & summarization
•
Autonomous monitoring
•
‘Structure discovery’ algorithms
ROSAtalk - Dan Ellis
2001-06-21 - 30
Lab
ROSA
DOMAINS
LabROSA Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
•
•
•
•
•
ROSAtalk - Dan Ellis
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
2001-06-21 - 31
Lab
ROSA
Download