Lab ROSA Recognition & Organization

advertisement
Recognition & Organization
of Speech & Audio
Dan Ellis
http://labrosa.ee.columbia.edu/
Outline
1
Introducing LabROSA
2
Projects in speech, music & audio
3
Summary
ROSA overview - Dan Ellis
2001-09-28 - 1
Lab
ROSA
Sound organization
1
frq/Hz
4000
2000
0
0
2
4
6
8
10
12
time/s
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
•
Central operation:
- continuous sound mixture
→ distinct objects & events
•
Perceptual impression is very strong
- but hard to ‘see’ in signal
ROSA overview - Dan Ellis
2001-09-28 - 2
Lab
ROSA
Bregman’s lake
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
•
Received waveform is a mixture
- two sensors, N signals ...
•
Disentangling mixtures as primary goal
- perfect solution is not possible
- need knowledge-based constraints
ROSA overview - Dan Ellis
2001-09-28 - 3
Lab
ROSA
The information in sound
freq / Hz
Steps 1
Steps 2
4000
3000
2000
1000
0
0
1
2
3
4
0
1
2
3
4
time / s
•
A sense of hearing is evolutionarily useful
- gives organisms ‘relevant’ information
•
Auditory perception is ecologically grounded
- scene analysis is preconscious (→ illusions)
- special-purpose processing reflects
‘natural scene’ properties
- subjective not canonical (ambiguity)
ROSA overview - Dan Ellis
2001-09-28 - 4
Lab
ROSA
Key themes for LabROSA
http://labrosa.ee.columbia.edu/
•
Sound organization: construct hierarchy
- at an instant (sources)
- along time (segmentation)
•
Scene analysis
- find attributes according to objects
- use attributes to form objects
- ... plus constraints of knowledge
•
Exploiting large data sets (the ASR lesson)
- supervised/labeled: pattern recognition
- unsupervised: structure discovery, clustering
•
Special cases:
- speech recognition
- other source-specific recognizers
•
... within a ‘complete explanation’
ROSA overview - Dan Ellis
2001-09-28 - 5
Lab
ROSA
Outline
1
Introducing LabROSA
2
Projects in speech, music & audio
- Tandem speech recognition
- ‘Meeting recorder’ speech analysis
- Musical information extraction
- Alarm sound detection
3
Summary
ROSA overview - Dan Ellis
2001-09-28 - 6
Lab
ROSA
Automatic Speech Recognition (ASR)
•
Standard speech recognition structure:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Acoustic
classifier
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
•
‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
•
Can use multiple streams...
ROSA overview - Dan Ellis
2001-09-28 - 7
Lab
ROSA
Tandem speech recognition
(with Manuel Reyes, ICSI, OGI, CMU)
•
Neural net estimates phone posteriors;
but Gaussian mixtures model finer detail
•
Combine them!
Hybrid Connectionist-HMM ASR
Feature
calculation
Conventional ASR (HTK)
Noway
decoder
Neural net
classifier
Feature
calculation
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
s
ah
t
tn+w
Input
sound
Words
Phone
probabilities
Speech
features
Input
sound
Subword
likelihoods
Speech
features
Tandem modeling
Feature
calculation
Neural net
classifier
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
tn+w
Input
sound
•
Speech
features
Phone
probabilities
Subword
likelihoods
Words
Train net, then train GMM on net output
- GMM is ignorant of net output ‘meaning’
ROSA overview - Dan Ellis
2001-09-28 - 8
Lab
ROSA
Words
Tandem system results: Aurora digits
WER as a function of SNR for various Aurora99 systems
100
Average WER ratio to baseline:
HTK GMM:
100%
Hybrid:
84.6%
Tandem:
64.5%
Tandem + PC: 47.2%
WER / % (log scale)
50
20
10
5
HTK GMM baseline
Hybrid connectionist
Tandem
Tandem + PC
2
1
-5
0
5
10
15
SNR / dB (averaged over 4 noises)
20
clean
Aurora 2 Eurospeech 2001 Evaluation
Columbia
Philips
Rel improvement % - Multicondition
60
UPC Barcelona
Bell Labs
IBM
50
40
Motorola 1
Motorola 2
Nijmegen
ICSI/OGI/Qualcomm
30
ATR/Griffith
AT&T
Alcatel
Siemens
20
10
0
Avg. rel. improvement
-10
ROSA overview - Dan Ellis
UCLA
Microsoft
Slovenia
Granada
2001-09-28 - 9
Lab
ROSA
The Meeting Recorder project
(with ICSI, UW, SRI, IBM)
•
Microphones in conventional meetings
- for summarization/retrieval/behavior analysis
- informal, overlapped speech
•
Data collection (ICSI, UW, ...):
- 100 hours collected, ongoing transcription
- headsets + tabletop + ‘PDA’
ROSA overview - Dan Ellis
2001-09-28 - 10
Lab
ROSA
Crosstalk cancellation
•
Baseline speaker activity detection is hard:
Spkr A
speaker
active
speaker B
cedes floor
Spkr B
Spkr C
interruptions
Spkr D
breath
noise
Spkr E
crosstalk
level/dB
40
Table
top
120
20
0
125
130
135
140
145
150
155
time / secs
•
Noisy crosstalk model: m = C ⋅ s + n
•
Estimate subband CAa from A’s peak energy
- ... including pure delay (10 ms frames)
- ... then linear inversion
ROSA overview - Dan Ellis
2001-09-28 - 11
Lab
ROSA
PDA-based speaker change detection
•
Goal: small conference-tabletop device
•
Speaker turns from PDA mock-up signals?
pda.aif: excerpt with 512-pt xcorr, 80% max thresh
8000
6000
4000
2000
0
Morgan
31
32
what's this one here this this one has a couple
Adam
Dan
33
34
35
cheesy electrets oh
36
37
38
that's connected too or
39
yeah that's
40
that's connected twoyep
that's the dummy dummy P.D.A.
both
yeah both channels
yeah it is
1.5
lag/ms
1
0.5
0
-0.5
-1
-1.5
31
32
33
34
35
36
37
38
39
40
31
32
33
34
35
time/s
36
37
38
39
40
0.5
0
-0.5
•
SCD algo on spectral + interaural features
- average spectral + per-channel ITD, ∆φ
ROSA overview - Dan Ellis
2001-09-28 - 12
Lab
ROSA
Music analysis: Lyrics extraction
(with Adam Berenzweig)
•
Vocal content is highly salient,
useful for retrieval
•
Can we find the singing?
Use an ASR classifier:
freq / kHz
music (no vocals #1)
singing (vocals #17 + 10.5s)
4
2
0
singing
phone class
Posteriogram Spectrogram
speech (trnset #58)
0
1
2
time / sec
0
1
2
time / sec
0
1
2
time / sec
•
Frame error rate ~20% for segmentation based
on posterior-feature statistics
•
Lyric segmentation + transcribed lyrics
→ training data for lyrics ASR...
ROSA overview - Dan Ellis
2001-09-28 - 13
Lab
ROSA
Music analysis: Structure recovery
(with Rob Turetsky)
•
Structure recovery by similarity matrices
(after Foote)
- similarity distance measure?
- segmentation & repetition structure
- interpretation at different scales:
notes, phrases, movements
- incorporating musical knowledge:
‘theme similarity’
ROSA overview - Dan Ellis
2001-09-28 - 14
Lab
ROSA
freq / Hz
Alarm sound detection
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
•
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
representation of energy in time-frequency
formation of atomic elements
grouping by common properties (onset &c.)
classify by attributes...
Key: recognize despite background
ROSA overview - Dan Ellis
2001-09-28 - 15
Lab
ROSA
The ‘Machine listener’
•
Goal: An auditory system for machines
- use same environmental information as people
•
Aspects:
- recognize spoken commands (but not others)
- track ‘acoustic channel’ quality (for responses)
- categorize environment (conversation, crowd...)
•
Scenarios
- personal listener → summary of your day
- autonomous robots: need awareness
ROSA overview - Dan Ellis
2001-09-28 - 16
Lab
ROSA
Outline
1
Introducing LabROSA
2
Projects in speech, music & audio
3
Summary
ROSA overview - Dan Ellis
2001-09-28 - 17
Lab
ROSA
DOMAINS
LabROSA Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
•
•
•
•
•
ROSA overview - Dan Ellis
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
2001-09-28 - 18
Lab
ROSA
Download