Robustness, Separation & Pitch Morgan, Me & Pitch or

advertisement
Robustness, Separation & Pitch
or
Morgan, Me & Pitch
Dan Ellis
Columbia / ICSI
dpwe@ee.columbia.edu
http://labrosa.ee.columbia.edu/
1. Robustness and Separation
2. An Academic Journey
3. Future
COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 1 /16
1953: How To Separate Speech?
• The “Cocktail Party Problem” [Cherry ’53]
• Spatial information: ATC over a single speaker
• Pitch differences via gender differences
• Auditory Scene Analysis [Bregman ’90]
• Grouping cues
• Onset
• Harmonicity
• Common Fate
• “Schema”
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 2 /16
The Usefulness of Pitch
•
Common pitch can link energy !"#$%&$'()#*"&+$,-"#$'(!$./-#0
Brungart et al.’01
from a single source
1232'(4-**2&2#52.
Normal mix
232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#
1-.,2#(*"&(,72(9%-2,2&(,$'/2&:
Robustness, Separation, Pitch - Dan Ellis
“Pitchless”
2015-03-14 - 3 /16
1984: Perception-Inspired Separation
• Model the periodicity information
in the auditory nerve
Lyon 1984
Weintraub 1985
Mix
Female
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 4 /16
1996: An Academic Journey
t h e en er gy visible in t h e fu ll sign a l qu it e closely; n ot e, h owever , t h e h oles in
t h e weft en velopes, pa r t icu la r ly a r ou n d 300 H z in t h e secon d weft ; a t t h ese
poin t s, t h e t ot a l sign a l en er gy is fu lly expla in ed by t h e ba ckgr ou n d n oise
elem en t , a n d, in t h e a bsen ce of st r on g eviden ce for per iodic en er gy fr om t h e
in dividu a l cor r elogr a m ch a n n els, t h e weft in t en sit y for t h ese ch a n n els h a s
been ba cked off t o zer o.
• Dan in 1996: The “weft”
f/Hz
"Bad dog"
4000
2000
1000
400
200
1000
400
200
100
50
0.0
f/Hz
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Wefts1,2
4000
2000
1000
400
200
1000
400
200
100
50
f/Hz
Click1
Click2
Click3
4000
2000
1000
400
200
f/Hz
Noise1
4000
Robustness,
Separation, Pitch - Dan Ellis
2000
−40
2015-03-14 - 5 /16
1999: “Size Matters”
• All you need is a BDNN
Ellis & Morgan’99
• .. and the data (and patience) to train it
WER vs. frames/weight
WER for PLP12N nets vs. net size & training data
44
178 GCUP
42
44
356 GCUP
42
712 GCUP
40
WER%
WER%
40
38
1.42 TCUP
38
36
2.8 TCUP
36
34
32
9.25
500
18.5
Training set / hours
5.7 TCUP
34
1000
37
2000
11.4 TCUP
Hidden layer / units
74 4000
32
1
2
5
10
20
50
100
200
500
frames/weight
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 6 /16
2001: Overlap Remains
• Meeting Recorder Project
Janin, Baron, Edwards, Ellis, Gelbart, Morgan, Peskin, Pfau, Shriberg, Stolcke, Wooters ’03
• natural speech interactions
• ~10% of speech frames have overlaps
backchannel
mr-2000-06-30-1600
floor seizure
Spkr A
speaker
active
speaker B
cedes floor
Spkr B
Spkr C
interruptions
Spkr D
breath
noise
Spkr E
crosstalk
40
20
Table
top
level/dB
120
125
130
135
140
145
Robustness, Separation, Pitch - Dan Ellis
150
155
time / secs
2015-03-14 - 7 /16
2003: EARS
• “Pushing the envelope (aside)”
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 8 /16
2004: Pitch Based Separation
• Literal implementations of the process
described in Bregman 1990:
• compute “regularity” cues:
- common onset
- gradual change
- harmonic patterns
- common fate
Original v3n7
Brown 1992
Ellis 1996
Hu & Wang 2004
Robustness, Separation, Pitch - Dan Ellis
Hu & Wang 2004
2015-03-14 - 9 /16
2004: Model-Based Separation
• Data-driven separation
Roweis ’01
Kristjansson, Attias, Hershey ’04
• Learn codebooks for individual speakers
• Find best combination of sources
• Pitch gives the “grist”
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 10/16
2006: Pitch for VAD
Lee & Ellis’06
• Pitch is the most robust perceptual cue to
speech
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 11/16
2004-2011: The Epic
Speech and Audio
Signal Processing
Processing and Perception of Speech and Music
S
E
C
O
N
D
E
D
I
T
I
O
N
Ben Gold t Nelson Morgan t Dan Ellis
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 12/16
2012: Project Babel
freq / kHz
• Noisy speech is a challenge:
IARPA_BABEL_OP1_204_73990_20130521_162632_inLine
4
20
3
0
2
-20
1
0
0
-40
5
10
15
20
• How to disentangle speech and
25 time / sec level / dB
interference?
• Energy peaks are speech (spectral subtraction)
• Energy troughs are noise (Wiener, log-mmse)
• Speech has a known form (Factorial HMM)
• Voiced speech is periodic (Pitch-based)
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 13/16
Classification-based Pitch Tracker
• Subband Autocorrelation Classification
Lee & Ellis ’12
(SAcC) Pitch Tracker:
• Trained on noisy speech with true pitch targets
de
lay
l
short-time
autocorrelation
ine
Neural network classifier
ck
c1
BN
frequency channels
Sound
Cochlea
filterbank
uv
60.0
61.8
63.6
65.4
Pitch
Viterbi
smoother
B3
B2
B1
freq
lag
392.1
403.6
Correlogram
slice
time
• Subband autocorrelation features
• PCA to reduce dimensions
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 14/16
Flat-Pitch Processing
• Time-varying filtering is tricky
• if pitch variation and filter impulse response are on
a similar time-scale noisy
Flatten the pitch
• use local pitch estimate
to resample
• process constant-pitch
• resampling is (near) invertible
Noisy signal
1000
SAcC
pitch
tracker
freq / Hz
• Solution: speech
500
0
pitch-flatten
time map
pr(vx)
pitch
time-varying
resampling
Resampled to pitch = 200 Hz
flat-pitch-domain
enhancement
inverse
time map
Filtered − comb
time-varying
resampling
Resampled back to original pitch and mixed with original
enhanced
speech
0
Robustness, Separation, Pitch - Dan Ellis
0.5
1
1.5
time / s
2015-03-14 - 15/16
Conclusions
• Pitch is a key feature for speech separation
• “marking” signal against other speech or noise
• Some ideas don’t go away
• .. though they can change shape
• Impact from collaboration
• you can’t do good work with someone
without the human connection
Robustness, Separation, Pitch - Dan Ellis
2015-03-14 - 16/16
Download