Audio & Music Research at Lab ROSA Dan Ellis

advertisement
Audio & Music Research
at LabROSA
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Eigenrhythms: Representing drum tracks
Frequency-Domain Linear Prediction
Segmenting meeting turns
Analyzing ‘personal audio’ recordings
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
LabROSA Projects Overview
Information
Extraction
Music Eigenrhythms
Machine Meeting
Learning turns
Environment
Personal
audio
FDLP Signal
Processing
Speech
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
1. Eigenrhythms: Drum Pattern Space
with John Arroyo
• Pop songs built on repeating “drum loop”
bass drum, snare, hi-hat
small variations on a few basic patterns
• Eigen-analysis (PCA) to capture variations?
by analyzing lots of (MIDI) data
• Applications
music categorization
“beat box” synthesis
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
Aligning the Data
• Need to align patterns prior to PCA...
tempo (stretch):
by inferring BPM &
normalizing
downbeat (shift):
correlate against
‘mean’ template
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
Eigenrhythms
20+ Eigenvectors for good coverage
• Need
of 100 training patterns (1200 dims)
• Top patterns:
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
Eigenrhythms for Classification
All tracks projected onto 1st two eigenrhythms
6
in
• Clusters
Eigenspace:
ho:inside
hh:gThang
rb:honey
4
rc:whteroom
pp:dlla
l
hh:rufryder
bl:hideaway
rb:heyLover
Eigenrhythm 2
2
0
rc:californ
nw:evcount
s
nw:psboysi
n ho:pvandyk di:danqueen pp:distance
di:boot
yrc:zztop
nw:dontyou
rb:mgirlsat
hh:1mChance
rb:downlow
nw:pur
e
di:satnight
nw:amadeus
pu:blitzkr
gpu:bSedated
rc:jump
di:funkytwn
hh:nEpisode
co:alabama
hh:bigpimpn
pu:rubysoho
rc:money
hh:stan
hh:jackson
bl:crosfire
rc:tuesdays
bl:thrill
pp:lkvirgin
pp:fly
hh:slmshady
pu:beatbrat
rc:hardday
rc:blackdog
nw:deserve
pu:waitinRm
pp:lvprayer
co:SArose
hh:superst
rdi:lafreak
pp:mjBeatit
di:dontstop
co:walkline
nw:bmonday
nw:whipi
trb:chgWorld
pp:loveshck
rc:rolstone
di:carwash bl:meanwoma
nw:dbdance
bl:blues2gm
co:aftermid
co:walkmi
d
ho:modjo
pu:happyguy
pu:bombshel
co:goodlook
bl:onebeer
hh:bigPoppa
bl:dimples
co:byYrMan
bl:chicken
co:texas
rc:layl
a
co:tennesse
rb:volove
di:boogient
ho:bemylove
pu:aWal
k ho:dpworld
rb:lsaround
-2
di:discoinf
di:boogiewl
pp:bholly
ho:onemore
bl:boomboom
-4
co:ringfire
rb:bismine
nw:banvenus
ho:badtouch
-6
-6
-4
-2
0
Eigenrhythm 1
pp:onemore
pu:anarchy
pp:downundr
2
4
• Genre classification? (10 way)
nearest neighbor in 4D eigenspace: 21% correct
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
6
Eigenrhythm BeatBox
• Resynthesize rhythms from eigen-space
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
2. Frequency-Domain Lin. Pred.
Linear
Linear Prediction
Prediction
with Marios Athineos
domain
(Time-domain)
Linear Prediction
••• Time
Time
domain
– The well-known spectral estimator
spectralestimator
estimator
–the
Thewell-known
well-known spectral
TDLP
TDLP
a y[n ! i] + e[n]
y[n]
"
y[n] == i =1..
"p aii y[n ! i] + e[n]
i =1.. p
Apply to adomain
‘frequency domain’ signal
••• Frequency
Frequency domain
estimates
temporal
envelope
––dual:
Frequency
is
time
and
vice
Frequency is time and vice versa
versa
DCT
DCT
FDLP
bFDLP
Y[k ! i] + E[k]
Y[k]
"
Y[k] == i =1..
"pbiiY[k ! i] + E[k]
i =1.. p
AthineosAudio/Music
& Ellis - Music processing
with FDLP
@ LabROSA
- Dan
Athineos & Ellis - Music processing with FDLP
Ellis
2004-05-25
2004-08-24
2004-05-25
4/16
4/16
Aside:DCT
Spectrogram
of
the
DCT
spectrogram
•
DCT gives a pure-real signal:
•
Looks like a mirror image over time = freq axis
Can we treat it like a waveform?
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
FDLP and TDLP Duality
!,-.
),-.
)*+#
Audio/Music @ LabROSA - Dan Ellis
!"#$%#&'(
2004-08-24
Subband FDLP
•
Time-frequency
Temporal
envelopes without slicing
25 ms windows
Auditory STFT
(10-25ms + Bark bin)
TDLP
(per time frame)
Subband FDLP
(per frequency subband)
Audio/Music @ LabROSA - Dan Ellis
Athineos & Ellis - Music processing with FDLP
2004-08-24
2004-05-25
12/16
Cascade
FDLPTime-Frequency
Applications LP
•
• Time-scale
Analysis modification
••
Temporal equalization
Modulation-domain
• Filtering in frequency “temporal equalization”
Residual
DCT in freq.
Synthesis
OLA
& iDCT
1 sec up to whole sample
Overlap
•
Perceptual
audio features...
(temporal equalization)
Athineos & Ellis - Music
processingby
with
FDLP FDLP
= Filtering
inverse
Audio/Music @ LabROSA - Dan Ellis
Athineos & Ellis - Music processing with FDLP
Flat
Temporal
Envelopes
2004-05-25
2004-08-24
2004-05-25
13/16
8/16
PLP-squared
Marios Athineos
Hynek Hermansky
• FDLP fits temporal envelope with LP
Perceptual Linear Prediction (PLP) smooths across
frequency
can we do both... iteratively?
• Speech features without ST windows
Bark band
15
10
5
0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 t / sec
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
3. Meeting Turns
with Jerry Liu and ICSI
• Multi-mic recordings for speaker turns
every voice reaches every mic... (?)
... but with differing coupling filters (delays, gains)
• Find turns with minimal assumptions
e.g. ad-hoc sensor setups (multiple PDAs)
differences to remove effect of source signal
- no spectral models, < 1xRT
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
Between-channel cues:
Timing (ITD) & Level
Speaker activity
Speaker
ground-truth
xocrr peak lags (5pt med filt)
skew/samp
50
Timing diffs (ITD)
(2 mic pairs, 250ms win)
0
-50
125
130
135
140
145
150
norm xcorr pk val
155
160
165
170
175
1
Peak correlation
coefficient r
0.5
0
125
130
135
140
145
150
per-chan E
155
160
165
170
175
-40
Per-channel
energy
dB
-50
-60
-70
-80
125
130
135
140
145
150
chan E diffs
155
160
165
170
175
10
Between-channel
energy differences
dB
5
0
-5
-10
125
130
135
140
145
150
time/s
155
160
Audio/Music @ LabROSA - Dan Ellis
165
170
175
2004-08-24
Pre-whitening for ITD
by 12-pole LPC models (32 ms
• Inverse-filter
windows) to remove local resonances
• Filter out noise < 500 Hz, > 6 kHz
• Then cross-correlate...
lag / samps
100
Short-time xcorr: raw signals
100
50
50
0
0
-50
-50
spkr ID
-100
1220
1225
1230
1235
Speaker ground truth
1240
-100
1220
6
6
4
4
2
2
1220
1225
1230
time / sec
1235
1240
Audio/Music @ LabROSA - Dan Ellis
Short-time xcorr: whitened+filtered signals
1220
1225
1230
1235
Speaker ground truth
1240
1225
1240
1230
time / sec
1235
2004-08-24
Choosing “Good” Frames
coef. r
• Correlation
~ channel similarity:
!n mi[n] · m j [n + !]
ri j [!] = !
! m2i ! m2j
• Select frames with r in top 50% in both pairs
•
ITD - high-correlation points (435/1201)
0
Skew34 / samples
Skew34 / samples
ITD - all points
-50
-100
0
-50
-100
•
• Cleaner basis for models
-150
-100
-50
Skew12 / samples
0
Audio/Music @ LabROSA - Dan Ellis
about 35% of points
-150
-100
-50
Skew12 / samples
0
2004-08-24
Spectral clustering
of “affinity matrix” A
• Eigenvectors
to pick out similar points:
Affinity matrix A
first 12 eigenvectors (normalized)
2
400
0
350
300
-2
250
-4
200
-6
150
-8
100
-10
50
100
200
300
point index
400
-12
0
100
•
• Ad-hoc mapping to clusters
200
300
point index
400
amn = exp{−"x[m] − x[n]"2/2!2}
Number of clusters K from eigenvalues ≈ points
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
Speaker Models & Classification
• Actual clusters depend on ! and K heuristic
Gaussians to each cluster,
• Fit
assign that class to all frames within radius
or: consider dimensions independently, choose best
ICSI0: good points
All pts: nearest class
All pts: closest dimension
0
0
0
-20
-20
-20
-40
-40
-40
-60
-60
-60
-80
-80
-80
-100
-100
-50
0
-100
-100
Audio/Music @ LabROSA - Dan Ellis
-50
0
-100
-100
-50
0
2004-08-24
Performance Analysis
• Compare reference & system activity maps:
system misses quiet speakers 2,3,4 (deletions)
system splits speaker 6 (deletions+insertions)
many short gaps (deletions)
• ~52% avg. error on NIST 2004 dev set
speaker-characteristic-based systems ~25%
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
4. Segmenting Personal Audio
• Easy to record everything you hear
~100GB / year @ 64 kbps
• Very hard to find anything
with Kean sub Lee
how to scan?
how to visualize?
how to index?
• Starting point: Collect data
~ 60 hours (8 days, ~7.5 hr/day)
hand-mark 139 segments (26 min/seg avg.)
assign to 16 classes (8 have multiple instances)
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
Features for Long Recordings
• Feature frames = 1 min (not 25 ms!)
• Characterize variation within each frame...
Normalized Energy Deviation
Average Linear Energy
120
15
100
10
80
15
40
10
20
5
5
dB
Average Log Energy
60
dB
Log Energy Deviation
120
15
100
10
80
20
freq / bark
20
freq / bark
60
20
freq / bark
freq / bark
20
5
15
15
10
10
5
5
60
dB
dB
Spectral Entropy Deviation
Average Spectral Entropy
0.9
0.8
15
0.7
10
5
•
0.6
0.5
bits
20
freq / bark
freq / bark
20
0.5
15
0.4
10
0.3
0.2
5
0.1
50
100
150
200
250
300
350
400
450
time / min
and structure within coarse auditory bands
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
bits
BIC Segmentation
• Untrained segmentation technique
statistical test indicates good change points:
log
L(X1 ;M1 )L(X2 ;M2 )
L(X;M0 )
≷
λ
2
log(N )∆#(M )
• Evaluate: 60hr hand-marked boundaries
different features & combinations
Correct Accept % @ False Accept = 2%:
80.8%
81.1%
81.6%
84.0%
83.6%
0.8
0.7
Sensitivity
µdB
µH
σH/µH
µdB + σH/µH
µdB + σH/µH + µH
0.6
0.5
0.3
0.2
0
Audio/Music @ LabROSA - Dan Ellis
µdB
µH
!H/µH
µdB + !H/µH
µdB + µH + !H/µH
0.4
0.005
0.01
0.015
0.02
0.025
1 - Specificity
0.03
0.035
2004-08-24
0.04
Segment clustering
activity has lots of repetition:
• Daily
Automatically cluster similar segments
1
supermkt
meeting
karaoke
barber
lecture2
billiard
break
lecture1
car/taxi
home
bowling
street
restaurant
library
campus
0.5
cmp
lib rst str ...
0
• Spectral clustering achieves ~70% correct
16-way ground truth labels
KL distance, smoothed covariance estimates
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
Future Work
• Visualization / browsing / diary inference
link to other information sources
•
• Privacy protection
speaker/speech “search and destroy”
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
LabROSA Summary
• LabROSA
signal processing
+ machine learning
+ information extraction
• Applications
Eigenrhythms: drum pattern models
FDLP temporal envelopes
Meeting recordings
Personal audio analysis
• Also...
music similarity, signal separation, ...
Audio/Music @ LabROSA - Dan Ellis
2004-08-24
Download