Recognition & Organization of Speech & Audio

advertisement
Recognition & Organization
of Speech & Audio
Dan Ellis
Electrical Engineering, Columbia University
<dpwe@ee.columbia.edu>
http://www.ee.columbia.edu/~dpwe/
Outline
1
Introducing LabROSA
2
Speech recognition & processing
3
Auditory Scene Analysis
4
Projects & applications
5
Summary
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 1
Lab
ROSA
Sound organization
1
frq/Hz
4000
2000
0
0
2
4
6
8
10
12
time/s
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
•
Central operation:
- continuous sound mixture
→ distinct objects & events
•
Perceptual impression is very strong
- but hard to ‘see’ in signal
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 2
Lab
ROSA
Bregman’s lake
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
•
Received waveform is a mixture
- two sensors, N signals ...
•
Disentangling mixtures as primary goal
- perfect solution is not possible
- need knowledge-based constraints
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 3
Lab
ROSA
The information in sound
freq / Hz
Steps 1
Steps 2
4000
3000
2000
1000
0
0
1
2
3
4
0
1
2
3
4
time / s
•
A sense of hearing is evolutionarily useful
- gives organisms ‘relevant’ information
•
Auditory perception is ecologically grounded
- scene analysis is preconscious (→ illusions)
- special-purpose processing reflects
‘natural scene’ properties
- subjective not canonical (ambiguity)
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 4
Lab
ROSA
Key themes for LabROSA
http://labrosa.ee.columbia.edu/
•
Sound organization: construct hierarchy
- at an instant (sources)
- along time (segmentation)
•
Scene analysis
- find attributes according to objects
- use attributes to form objects
- ... plus constraints of knowledge
•
Exploiting large data sets (the ASR lesson)
- supervised/labeled: pattern recognition
- unsupervised: structure discovery, clustering
•
Special cases:
- speech recognition
- other source-specific recognizers
•
... within a ‘complete explanation’
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 5
Lab
ROSA
Outline
1
Introducing LabROSA
2
Speech recognition & processing
- Connectionist and tandem recognition
- Speech and speaker detection
- Musical information extraction
3
Auditory Scene Analysis
4
Projects & applications
5
Summary
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 6
Lab
ROSA
Automatic Speech Recognition (ASR)
•
Standard speech recognition structure:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Acoustic
classifier
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
•
‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
•
Can use multiple streams...
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 7
Lab
ROSA
The connectionist-HMM hybrid
(Morgan & Bourlard, 1995)
•
Conventional recognizers use P(Xi|Si),
acoustic likelihood model
- model distribution with, e.g., Gaussian mixtures
•
Can replace with posterior, P(Si|Xi):
Neural network acoustic classifier
h#
pcl
bcl
tcl
dcl
C0
C1
Feature
calculation
Word
models
HMM
decoder
C2
sound
Language
model
Posterior probabilities
for context-independent
phone classes
Ck
tn
Words
tn+w
- neural network estimates phone given acoustics
- discriminative
•
Simpler structure for research
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 8
Lab
ROSA
Tandem speech recognition
(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI)
•
Neural net estimates phone posteriors;
but Gaussian mixtures model finer detail
•
Combine them!
Hybrid Connectionist-HMM ASR
Feature
calculation
Conventional ASR (HTK)
Noway
decoder
Neural net
classifier
Feature
calculation
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
s
ah
t
tn+w
Input
sound
Words
Phone
probabilities
Speech
features
Input
sound
Subword
likelihoods
Speech
features
Tandem modeling
Feature
calculation
Neural net
classifier
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
tn+w
Input
sound
•
Speech
features
Phone
probabilities
Subword
likelihoods
Words
Train net, then train GMM on net output
- GMM is ignorant of net output ‘meaning’
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 9
Lab
ROSA
Words
Tandem system results
•
It works very well (‘Aurora’ noisy digits):
WER as a function of SNR for various Aurora99 systems
100
Average WER ratio to baseline:
HTK GMM:
100%
Hybrid:
84.6%
Tandem:
64.5%
Tandem + PC: 47.2%
WER / % (log scale)
50
20
10
5
HTK GMM baseline
Hybrid connectionist
Tandem
Tandem + PC
2
1
-5
0
System-features
5
10
15
SNR / dB (averaged over 4 noises)
20
clean
Avg. WER 20-0 dB
Baseline WER ratio
HTK-mfcc
13.7%
100%
Neural net-mfcc
9.3%
84.5%
Tandem-mfcc
7.4%
64.5%
Tandem-msg+plp
6.4%
47.2%
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 10
Lab
ROSA
Connectionist speaker recognition
(with Dominique Genoud)
•
Use neural networks to model speakers rather
than phones?
•
Specialize a phone classifier for a particular
speaker?
•
Do both at once for “Twin-output MLP”:
Feature vectors
Posterior
probability
vectors
Output layer
Current vector
t0-4
t0+4
12 features
Non-speech
53 'speaker'
phones
9x12 features = 108 inputs
t
ROSAtalk @ NEC - Dan Ellis
MLP
108:2000:107
53 'world'
phones
2001-08-17 - 11
Lab
ROSA
Speech/music discrimination
(with Gethin Williams)
•
Neural net is very sensitive to speech:
- characteristic jumping between phones
- define statistics to distinguish speech regions
e.g. entropy, ‘dynamism’ (delta-magnitude):
Dynamism vs. Entropy for 2.5 second segments of speecn/music
3.5
Spectrogram
frq/Hz
Speech
Music
Speech+Music
3
4000
2.5
2000
2
speech
music
Entropy
0
speech+music
1.5
40
1
20
0
0
2
4
6
8
10
12
time/s
0.5
Posteriors
0
0
0.05
0.1
0.15
0.2
Dynamism
0.25
•
1.4% classification error on 2.5 s segments
- use HMM structure for segmentation
•
Good predictor of ASR success
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 12
0.3
Lab
ROSA
Music analysis: Lyrics extraction
(with Adam Berenzweig)
•
Vocal content is highly salient,
useful for retrieval
•
Can we find the singing?
Use an ASR classifier:
freq / kHz
music (no vocals #1)
singing (vocals #17 + 10.5s)
4
2
0
singing
phone class
Posteriogram Spectrogram
speech (trnset #58)
0
1
2
time / sec
0
1
2
time / sec
0
1
2
time / sec
•
Frame error rate ~20% for segmentation based
on posterior-feature statistics
•
Lyric segmentation + transcribed lyrics
→ training data for lyrics ASR...
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 13
Lab
ROSA
Music analysis: Structure recovery
(with Rob Turetsky)
•
Structure recovery by similarity matrices
(after Foote)
- similarity distance measure?
- segmentation & repetition structure
- interpretation at different scales:
notes, phrases, movements
- incorporating musical knowledge:
‘theme similarity’
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 14
Lab
ROSA
Outline
1
Introducing LabROSA
2
Speech recognition & processing
3
Auditory Scene Analysis
- Perception of sound mixtures
- Illusions
- Computational modeling
4
Projects & applications
5
Summary
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 15
Lab
ROSA
Sound mixtures
•
Speech
4000
Frequency
Sound ‘scene’ is almost always a mixture
- always stuff going on
- sound is ‘transparent’
Noise
Speech+Noise
3000
=
+
2000
1000
0
0
0.5
1
Time
0
0.5
1
Time
0
0.5
1
Time
•
Need information related to our ‘world model’
- i.e. separate objects
- a wolf howling in a blizzard is the same as
a wolf howling in a rainstorm
- whole-signal statistics won’t do this
•
‘Separateness’ is similar to independence
- objects/sounds that change in isolation
- but: depends on the situation e.g.
passing car vs. mechanic’s diagnosis
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 16
Lab
ROSA
Human Sound Organization
•
“Auditory Scene Analysis” [Bregman 1990]
- break mixture into small elements (in time-freq)
- elements are grouped in to sources using cues
- sources have aggregate attributes
•
Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation,
harmonicity, spatial location, ...
(from
Darwin 1996)
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 17
Lab
ROSA
Cues and grouping
•
Common attributes and ‘fate’
freq
time
- harmonicity, common onset
→ perceived as a single sound source/event
•
But: can have conflicting cues
freq
∆f
∆t
time
- determine how ∆t and ∆f affect
• segregation of harmonic
• pitch of complex
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 18
Lab
ROSA
The effect of context
Context can create an ‘expectation’:
i.e. a bias towards a particular interpretation
•
e.g. Bregman’s “old-plus-new” principle:
A change in a signal will be interpreted as an
added source whenever possible
frequency / kHz
•
2
1
+
0
0.0
0.4
0.8
1.2
time / s
- a different division of the same energy
depending on what preceded it
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 19
Lab
ROSA
Computational ASA
•
Goal: Systems to ‘pick out’ sounds in a mixture
- ... like people do
•
Implement psychoacoustic theory?
input
mixture
signal
features
Front end
(maps)
Object
formation
discrete
objects
Source
groups
Grouping
rules
freq
onset
time
period
frq.mod
- ‘bottom-up’, using common onset & periodicity
•
frq/Hz
Able to extract voiced speech:
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
0.6
0.8
1.0
ROSAtalk @ NEC - Dan Ellis
time/s
0.2
0.4
0.6
0.8
2001-08-17 - 20
1.0
time/s
Lab
ROSA
Adding top-down cues
Perception is not direct
but a search for plausible hypotheses
•
Data-driven (bottom-up)...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
Grouping
rules
Source
groups
vs. Prediction-driven (top-down) (PDCASA)
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
•
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
Motivations
- detect non-tonal events (noise & click elements)
- support ‘restoration illusions’...
→ hooks for high-level knowledge
+ ‘complete explanation’, multiple hypotheses, ...
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 21
Lab
ROSA
PDCASA and complex scenes
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
1
ROSAtalk @ NEC - Dan Ellis
2
3
4
5
6
7
8
9
time/s
2001-08-17 - 22
dB
Lab
ROSA
Outline
1
Introducing LabROSA
2
Speech recognition & processing
3
Auditory Scene Analysis
4
Projects & applications
- Missing data recognition
- Hearing prostheses
- The machine listener
5
Summary
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 23
Lab
ROSA
Missing data recognition
(Cooke, Green, Barker... @ Sheffield)
•
Energy overlaps in time-freq. hide features
- some observations are effectively missing
•
Use missing feature theory...
- integrate over missing data dimensions xm
p( x q) =
•
∫ p ( xg
x m, q ) p ( x m q ) d x m
Effective in speech recognition
- trick is finding good/bad data mask
"1754" + noise
90
AURORA 2000 - Test Set A
80
Missing-data
Recognition
WER
70
MD Discrete SNR
MD Soft SNR
HTK clean training
HTK multi-condition
60
50
40
30
"1754"
20
10
Mask based on
stationary noise estimate
ROSAtalk @ NEC - Dan Ellis
0
-5
0
5
10
SNR (dB)
15
2001-08-17 - 24
20
Clean
Lab
ROSA
Multi-source decoding
(Jon Barker @ Sheffield)
•
Search of sound-fragment interpretations
"1754" + noise
Mask based on
stationary noise estimate
Mask split into subbands
`Grouping' applied within bands:
Spectro-Temporal Proximity
Common Onset/Offset
Multisource
Decoder
"1754"
•
CASA for masks/fragments
- larger fragments → quicker search
•
Use with nonspeech models?
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 25
Lab
ROSA
Audio Information Retrieval
•
Searching in a database of audio
- speech .. use ASR
- text annotations .. search them
- sound effects library?
•
e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions
- search by proximity in (weighted) feature space
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
- features are ‘global’ for each soundfile,
no attempt to separate mixtures
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 26
Lab
ROSA
CASA for audio retrieval
•
When audio material contains mixtures,
global features are insufficient
•
Retrieval based on element/object analysis:
Generic
element
analysis
Continuous audio
archive
Object
formation
Objects + properties
Element representations
Generic
element
analysis
Seach/
comparison
Results
(Object
formation)
Query example
rushing water
Symbolic query
Word-to-class
mapping
Properties alone
- features are calculated over grouped subsets
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 27
Lab
ROSA
freq / Hz
Alarm sound detection
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
•
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
representation of energy in time-frequency
formation of atomic elements
grouping by common properties (onset &c.)
classify by attributes...
Key: recognize despite background
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 28
Lab
ROSA
Future prosthetic listening devices
•
CASA to replace lost hearing ability
- sound mixtures are difficult for hearing impaired
•
Signal enhancement
- resynthesize a single source without background
- (need very good resynthesis)
•
Signal understanding
- monitor for particular sounds (doorbell, knocks)
& translate into alternative mode (vibro alarm)
- real-time textual descriptions
i.e. “automatic subtitles for real life”
[thunder]
S: I THINK THE
WEATHER'S
CHANGING
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 29
Lab
ROSA
The ‘Machine listener’
•
Goal: An auditory system for machines
- use same environmental information as people
•
Aspects:
- recognize spoken commands (but not others)
- track ‘acoustic channel’ quality (for responses)
- categorize environment (conversation, crowd...)
•
Scenarios
- personal listener → summary of your day
- autonomous robots: need awareness
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 30
Lab
ROSA
Outline
1
Introducing LabROSA
2
Tandem modeling: Neural net features
3
Meeting recorder data analysis
4
Computational Auditory Scene Analysis
5
Summary
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 31
Lab
ROSA
Summary:
Applications for sound organization
What do people do with their ears?
•
Human-computer interface
- .. includes knowing when (& why) you’ve failed
•
Robots
- intelligence requires perceptual awareness
- Sony’s AIBO: dog-hearing
•
Archive indexing & retrieval
- pure audio archives
- true multimedia content analysis
•
Content ‘understanding’
- intelligent classification & summarization
•
Autonomous monitoring
•
‘Structure discovery’ algorithms
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 32
Lab
ROSA
DOMAINS
LabROSA Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
•
•
•
•
•
ROSAtalk @ NEC - Dan Ellis
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
2001-08-17 - 33
Lab
ROSA
Extra slides...
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 34
Lab
ROSA
Positioning sound organization
Speech
recognition
Image/video
analysis
Sound
organization
Audio
processing
Comp. Aud.
Scene Analysis
Psychoacoustics
& physiology
Signal
processing
Pattern
recognition
Machine
learning
Artificial
intelligence
•
Draws on many techniques
•
Abuts/overlaps various areas
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 35
Lab
ROSA
Tandem recognition: Relative contributions
•
Approx relative impact on baseline WER ratio
for different component:
Combo over msg:
+20%
plp
Neural net
classifier
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
tn
tn+w
Input
sound
Pre-nonlinearity over +
posteriors: +12%
PCA
orthog'n
HTK
decoder
Gauss mix
models
s
msg
Neural net
classifier
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
KLT over
direct:
+8%
ah
Combo-into-HTK over
Combo-into-noway:
+15%
t
Words
tn
Combo over plp:
+20%
tn+w
Combo over mfcc:
+25%
NN over HTK:
+15%
Tandem over HTK:
+35%
Tandem over hybrid:
+25%
Tandem combo over HTK mfcc baseline: +53%
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 36
Lab
ROSA
Inside Tandem systems:
What’s going on?
•
Visualizations of the net outputs
Clean
3
3
2
2
1
1
0
0
10
10
5
5
0
0
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
freq / kHz
4
phone
Spectrogram
5dB SNR to ‘Hall’ noise
4
phone
“one eight three”
(MFP_183A)
10
0
-10
-20
-30
-40 dB
Cepstral-smoothed
mel spectrum
Phone posterior
estimates
Hidden layer
linear outputs
freq / mel chan
7
5
4
0
•
6
0.2
0.4
0.6
time / s
0.8
1
3
1
0.5
0
20
10
0
-10
0
0.2
0.4
0.6
time / s
0.8
1
-20
Neural net normalizes away noise
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 37
Lab
ROSA
Acoustic Change Detection (ACD)
(with Javier Ferreiros, UPM)
•
Find optimal segmentation points via
Bayesian Information Criterion (BIC)
•
Cluster segments to find underlying ‘sources’
•
Repeat segmentation incorporating cluster
assignments
Speech
Feature
Extraction 1
Feature
Extraction 2
Broad Class
Recognizer
Hypothesis
Generator
ACD
CLUSTERING
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 38
Lab
ROSA
The Meeting Recorder project
(with ICSI, UW, SRI, IBM)
•
Microphones in conventional meetings
- for summarization/retrieval/behavior analysis
- informal, overlapped speech
•
Data collection (ICSI, UW, ...):
- 100 hours collected, ongoing transcription
- headsets + tabletop + ‘PDA’
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 39
Lab
ROSA
Crosstalk cancellation
•
Baseline speaker activity detection is hard:
Spkr A
speaker
active
speaker B
cedes floor
Spkr B
Spkr C
interruptions
Spkr D
breath
noise
Spkr E
crosstalk
level/dB
40
Table
top
120
20
0
125
130
135
140
145
150
155
time / secs
•
Noisy crosstalk model: m = C ⋅ s + n
•
Estimate subband CAa from A’s peak energy
- ... including pure delay (10 ms frames)
- ... then linear inversion
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 40
Lab
ROSA
Participant motion detection
•
Cross-correlation gives speaker-mic coupling:
participant movement
mr-2000-11-02-1440
Spkr C
freq / chan
20
15
10
Coupling
C→A
delay / samples
5
20
15
10
5
Coupling
C → Tabletop
delay / samples
20
15
10
5
1020
1020.5
1021
1021.5
1022
1022.5
1023
1023.5
1024
1024.5
time / sec
•
Changes in coupling impulse response show
changes in path/orientation
•
Comparison between different channels
→ distinguish speaker and listener motion
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 41
Lab
ROSA
PDA-based speaker change detection
•
Goal: small conference-tabletop device
•
Speaker turns from PDA mock-up signals?
pda.aif: excerpt with 512-pt xcorr, 80% max thresh
8000
6000
4000
2000
0
Morgan
31
32
what's this one here this this one has a couple
Adam
Dan
33
34
35
cheesy electrets oh
36
37
38
that's connected too or
39
yeah that's
40
that's connected twoyep
that's the dummy dummy P.D.A.
both
yeah both channels
yeah it is
1.5
lag/ms
1
0.5
0
-0.5
-1
-1.5
31
32
33
34
35
36
37
38
39
40
31
32
33
34
35
time/s
36
37
38
39
40
0.5
0
-0.5
•
SCD algo on spectral + interaural features
- average spectral + per-channel ITD, ∆φ
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 42
Lab
ROSA
Computational ASA
CASA
Object 1 description
Object 2 description
Object 3 description
...
•
Goal: Automatic sound organization ;
Systems to ‘pick out’ sounds in a mixture
- ... like people do
•
E.g. voice against a noisy background
- to improve speech recognition
•
Approach:
- psychoacoustics describes grouping ‘rules’
- ... just implement them?
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 43
Lab
ROSA
CASA front-end processing
•
Correlogram:
Loosely based on known/possible physiology
de
lay
l
ine
short-time
autocorrelation
sound
Cochlea
filterbank
frequency channels
Correlogram
slice
envelope
follower
freq
lag
time
-
linear filterbank cochlear approximation
static nonlinearity
zero-delay slice is like spectrogram
periodicity from delay-and-multiply detectors
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 44
Lab
ROSA
Problems with ‘bottom-up’ CASA
freq
time
•
Circumscribing time-frequency elements
- need to have ‘regions’, but hard to find
•
Periodicity is the primary cue
- how to handle aperiodic energy?
•
Resynthesis via masked filtering
- cannot separate within a single t-f element
•
Bottom-up leaves no ambiguity or context
- how to model illusions?
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 45
Lab
ROSA
Generic sound elements for PDCASA
•
f/Hz
Goal is a representational space that
- covers real-world perceptual sounds
- minimal parameterization (sparseness)
- separate attributes in separate parameters
NoiseObject
profile
magProfile
H[k]
4000
ClickObject1
decayTimes
XT [n,k]
T[k]
f/Hz
2000
4000
2000
1000
1000
400
400
200
H[k]
0
0.00
magContour
0.01
A[n]
0.005
2.2
2.4
time / s
0.0
0.1
time / s
0.000
0.1
p[n]
200
100
XC[n,k] = H[k]·exp((n - n0) / T[k])
0.0
H W[n,k]
200
f/Hz
400
XN[n,k]
0.010
WeftObject1
0.2
time/s
XN[n,k] = H[k]·A[n]
50
f/Hz
400
Correlogram
analysis
Periodogram
200
100
50
0.0
0.2
0.4
0.6
0.8
time/s
XW[n,k] = HW[n,k]·P[n,k]
•
Object hierarchies built on top...
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 46
Lab
ROSA
PDCASA for old-plus-new
•
t1
Incremental analysis
t2
t3
Input signal
Time t1:
initial element
created
Time t2:
Additional
element required
Time t3:
Second element
finished
ROSAtalk @ NEC - Dan Ellis
2001-08-17 - 47
Lab
ROSA
Download