Document 10439027

advertisement
LabROSA
Research Overview
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
!
dpwe@ee.columbia.edu
http://labrosa.ee.columbia.edu/
1. Music
2. Environmental sound
3. Speech Enhancement
Laboratory for the Recognition and
Organization of Speech and Audio
LabROSA Research Overview - Dan Ellis
COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
2014-06-12 - 1 /20
LabROSA
• Getting information from sound
Information
Extraction
Music
Recognition
Environment
Separation
Machine
Learning
Retrieval
Signal
Processing
Speech
LabROSA Research Overview - Dan Ellis
2014-06-12 - 2 /20
1. Music Audio Analysis
Poliner & Ellis ’06
• Trained classifiers for low-level information
• notes, chords, beats, section boundaries
• E.g. Polyphonic transcription
!
!
!
!
!
!
• feature agnostic
• needs training data
LabROSA Research Overview - Dan Ellis
2014-06-12 - 3 /20
Million Song Dataset
• Industrial-scale database for music information research
• Many facets:
Bertin-Mahieux
McFee
• Echo Nest audio features
+ metadata
• Echo Nest “taste profile”
user-song-listen count
• Second Hand Song covers
• musiXmatch lyric BoW
• last.fm tags
• Now with audio?
• resolving artist / album / track / duration against what.cd
LabROSA Research Overview - Dan Ellis
2014-06-12 - 4 /20
MIDI-to-MSD
Raffel
• Aligned MIDI to Audio is a nice transcription
!
!
!
!
!
!
!
!
!
• Can we find matches in large databases?
LabROSA Research Overview - Dan Ellis
2014-06-12 - 5 /20
Singing ASR
McVicar
• Speech recognition adapted to singing
• needs aligned data
• Extensive work to line up scraped
“acapellas” and full mix
• including jumps!
LabROSA Research Overview - Dan Ellis
2014-06-12 - 6 /20
4, one can hear some high frequency isolated coefficients superimposed to the separated voice. This drawback could be reduced by
including harmonicity priors in the sparse component of RPCA,
as
Papadopoulos
proposed in [20].
• Ground
versus estimated
voice
activity location. ImRPCAtruth
separates
vocals and
background
perfect voice
location
still allows an improvebasedactivity
on low
rankinformation
optimization
ment, although to a lesser extent than with ground-truth voice acparameter
• single trade-off
tivity information.
The decrease
in the results mainly comes from
basedclassified
on higher-level
features?
• adjust
background
segments
as vocalmusical
segments.
Block Structure RPCA
•
e.
Fig.
4. Separated
for various
LabROSA
Research
Overview voice
- Dan Ellis
values of λ for the Pink Noise Party
song - 7 /20
2014-06-12
Ordinal LDA Segmentation
McFee
• Low-rank decomposition of skewed self-similarity to identify repeats
• Learned weighting of multiple factors to segment
!
• Linear Discriminant Analysis between adjacent segments
LabROSA Research Overview - Dan Ellis
2014-06-12 - 8 /20
2. Environmental Sound
• Extracting useful information from
soundtracks
• e.g. TRECVID Multimedia Event Detection
(MED)
• “Making a Sandwich”, “Getting a Vehicle Unstuck”
• 100 examples, find matches in 100k videos
• manual annotations for ~10 h
E009 Getting a Vehicle Unstuck
LabROSA Research Overview - Dan Ellis
2014-06-12 - 9 /20
Foreground Event Recognition
Cotton, Ellis, Loui ’11
• Transients =
foreground events?
• Onset detector
finds energy bursts
• best SNR
• PCA basis to
represent each
• 300 ms x auditory
freq
• “bag of transients”
LabROSA Research Overview - Dan Ellis
2014-06-12 - 10/20
NMF Transient Features
Smaragdis & Brown ’03
Abdallah & Plumbley ’04
Virtanen ’07
Cotton & Ellis’ 11
• Decompose spectrograms into
!
!
!
!X
!
freq / Hz
templates + activation
Original mixture
5478
3474
2203
1398
883
=W•H
• well-behaved gradient descent
• 2D patches
• sparsity control
• computation time…
LabROSA Research Overview - Dan Ellis
442
Basis 1
(L2)
Basis 2
(L1)
Basis 3
(L1)
0
1
2
3
4
5
6
7
8
9
10
time / s
2014-06-12 - 11/20
Background Retrieval
McDermott et al. ’09
Ellis, Zheng, McDermott ’11
• Classify soundtracks by statistics of ambience
• E.g. Texture features
!
!
Sound
\x\
Automatic
gain
control
\x\
mel
filterbank
(18 chans)
\x\
\x\
\x\
mean, var,
skew, kurt
(18 x 4)
1159_10 urban cheer clap
1062_60 quiet dubbed speech music
2404
1273
1
617
0
2
4
6
8
10
0
2
4
6
8
10
time / s
Texture features
mel band
freq / Hz
Modulation
energy
(18 x 6)
Cross-band
correlations
(318 samples)
Envelope
correlation
!
• Subband
distributions
• Envelope
cross-corrs
Histogram
\x\
!
!
Octave bins
0.5,1,2,4,8,16 Hz
FFT
0
level
15
10
5
M V S K 0.5 2 8 32
moments
LabROSA Research Overview - Dan Ellis
mod frq / Hz
5
10 15
mel band
M V S K 0.5 2 8 32
moments
mod frq / Hz
5
10 15
mel band
2014-06-12 - 12/20
Auditory Model Features
• Subband Autocorrelation PCA
Lyon et al. 2010
Lee & Ellis 2012
Cotton & Ellis 2013
• Simplified version of autocorrelogram
• 10x faster than Lyon original
• Capture fine time structure in multiple bands
• information lost in MFCCs
de
lay
short-time
autocorrelation
e
lin
frequency channels
Sound Cochlea
filterbank
Subband VQ
Subband VQ
Histogram
Feature Vector
Subband VQ
Subband VQ
freq
lag
Correlogram
slice
time
LabROSA Research Overview - Dan Ellis
2014-06-12 - 13/20
Subband Autocorrelation
lin
short-time
autocorrelation
e
Subband VQ
Sound Cochlea
filterbank
Subband VQ
Subband VQ
Histogram
Feature Vector
Autocorrelation stabilizes
fine time structure
lay
frequency channels
•
de
Subband VQ
freq
lag
Correlogram
slice
time
• 25 ms window,
lags up to 25 ms
• calculated every
10 ms
• normalized to max
(zero lag)
LabROSA Research Overview - Dan Ellis
2014-06-12 - 14/20
Retrieval Examples
• High precision for in-domain top hits
LabROSA Research Overview - Dan Ellis
2014-06-12 - 15/20
3. Speech Enhancement
• Noisy speech scenarios
• Ambient recording (background noise)
• Communication channel (processing distortion)
dB
CAR KIT - BP 101 43086 20111025 160708 in
100
50
0
0
1000
2000
3000 freq / Hz 0
4000
1000
2000
3000 freq / Hz
100
2000
level / dB
freq / Hz
HOME LAND - BP 101 84943 20111020 144955 in
50
100
0
level / dB
1500 Hz chan
50
0
87
90
93
96
LabROSA Research Overview - Dan Ellis
99
time / s
67
70
73
76
79
time / s
2014-06-12 - 16/20
RPCA Enhancement
• Decompose spectrogram into sparse + low-rank • Sparse activation H of Chen, McFee & Ellis ’14
dictionary W
min
!
H,L,S
H kHk1
L kLk⇤
+
+
!
+ I+ (H)
!
s.t. Y != W H + L + S
S kSk1
• ASR benefits:
C
S
D
I
Orig
6.8
10.6
82.6
0.7
RPCA
10.8
36.5
52.7
0.5
wie+RPCA
10.4
40.1
49.5
2.1
LabROSA Research Overview - Dan Ellis
2014-06-12 - 17/20
Classification Pitch Tracker
Lee & Ellis ’12
• SAcC: MLP trained on noisy speech with ground-truth pitch track targets
!
!
!
!
!
• Large benefits for
in-domain noisy
speech
YIN
Wu
get_f0
10
SAcC
10
−10
LabROSA Research Overview - Dan Ellis
FDA − RBF and pink noise
PTE (%)
!
100
0
10
20
SNR (dB)
30
2014-06-12 - 18/20
Pitch-Normalized Enhancement
• Use noise-robust pitch tracker for enhancement?
Clean signal
1000
500
• Normalize
voice pitch
0
!
• Reimpose pitch
0.5
1
1.5
2
Noisy signal
1000
2.5
Frequency
0
3
pitch
smoothed pvx
500
!
• Fixed-pitch
enhancement
0
0
0.5
1
1.5
2
resampled to pitch = 100 Hz
1000
2.5
3
500
0
0
0.5
1
1.5
2
2.5
Filtered − pvsmooth
0
0.5
1
1.5
2
2.5
3
Resampled back to original pitch
1000
3
3.5
4
4.5
3.5
4
4.5
500
0
1000
500
0
LabROSA Research Overview - Dan Ellis
0
0.5
1
1.5
2
2.5
3
Time
2014-06-12 - 19/20
Summary
• Music
• transcription, segmentation, …
• alignment for ground truth
!
• Soundtracks
• foreground events, background ambience
!
• Noisy Speech
• classification pitch tracking
• spectrogram enhancement
LabROSA Research Overview - Dan Ellis
2014-06-12 - 20/20
Download