Audio & Music Research at Lab ROSA Dan Ellis

advertisement
Audio & Music Research
at LabROSA
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
5.
http://labrosa.ee.columbia.edu/
Eigenrhythms: representing drum tracks
Frequency-Domain Linear Prediction
Anchor-Space Music Similarity Browsing
Transformation-based generative models
Analyzing ‘personal audio’ recordings
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
LabROSA Projects Overview
Information
Extraction
Music
Eigenrhythms
Anchor
space
Environment
Personal
audio
Machine Transform
model
Learning
FDLP Signal
Processing
Speech
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
1. Eigenrhythms: Drum Pattern Space
with John Arroyo
• Pop songs built on repeating “drum loop”
bass drum, snare, hi-hat
small variations on a few basic patterns
• Eigen-analysis (PCA) to capture variations?
by analyzing lots of (MIDI) data
• Applications
music categorization
“beat box” synthesis
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
Aligning the Data
• Need to align patterns prior to PCA...
tempo (stretch):
by inferring BPM &
normalizing
downbeat (shift):
correlate against
‘mean’ template
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
Eigenrhythms
20+ Eigenvectors for good coverage
• Need
of 100 training patterns (1200 dims)
• Top patterns:
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
Eigenrhythms for Classification
All tracks projected onto 1st two eigenrhythms
6
in
• Clusters
Eigenspace:
ho:inside
hh:gThang
rb:honey
4
rc:whteroom
pp:dlla
l
hh:rufryder
bl:hideaway
rb:heyLover
Eigenrhythm 2
2
0
rc:californ
nw:evcount
s
nw:psboysi
n ho:pvandyk di:danqueen pp:distance
di:boot
yrc:zztop
nw:dontyou
rb:mgirlsat
hh:1mChance
rb:downlow
nw:pur
e
di:satnight
nw:amadeus
pu:blitzkr
gpu:bSedated
rc:jump
di:funkytwn
hh:nEpisode
co:alabama
hh:bigpimpn
pu:rubysoho
rc:money
hh:stan
hh:jackson
bl:crosfire
rc:tuesdays
bl:thrill
pp:lkvirgin
pp:fly
hh:slmshady
pu:beatbrat
rc:hardday
rc:blackdog
nw:deserve
pu:waitinRm
pp:lvprayer
co:SArose
hh:superst
rdi:lafreak
pp:mjBeatit
di:dontstop
co:walkline
nw:bmonday
nw:whipi
trb:chgWorld
pp:loveshck
rc:rolstone
di:carwash bl:meanwoma
nw:dbdance
bl:blues2gm
co:aftermid
co:walkmi
d
ho:modjo
pu:happyguy
pu:bombshel
co:goodlook
bl:onebeer
hh:bigPoppa
bl:dimples
co:byYrMan
bl:chicken
co:texas
rc:layl
a
co:tennesse
rb:volove
di:boogient
ho:bemylove
pu:aWal
k ho:dpworld
rb:lsaround
-2
di:discoinf
di:boogiewl
pp:bholly
ho:onemore
bl:boomboom
-4
co:ringfire
rb:bismine
nw:banvenus
ho:badtouch
-6
-6
-4
-2
0
Eigenrhythm 1
pp:onemore
pu:anarchy
pp:downundr
2
4
• Genre classification? (10 way)
nearest neighbor in 4D eigenspace: 21% correct
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
6
Eigenrhythm BeatBox
• Resynthesize rhythms from eigen space
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
2. Frequency-Domain Lin. Pred.
Linear
Prediction
Linear
Prediction
with Marios
Athineos
domain
(Time-domain)
Linear Prediction
••• Time
Time
domain
– The well-known spectral estimator
spectralestimator
estimator
–the
Thewell-known
well-known spectral
TDLP
TDLP
a y[n ! i] + e[n]
y[n]
"
y[n] == i =1..
"p aii y[n ! i] + e[n]
i =1.. p
Apply to adomain
‘frequency domain’ signal
••• Frequency
Frequency domain
estimates
temporal
envelope
––dual:
Frequency
is
time
and
vice
Frequency is time and vice versa
versa
DCT
DCT
FDLP
bFDLP
Y[k ! i] + E[k]
Y[k]
"
Y[k] == i =1..
"pbiiY[k ! i] + E[k]
i =1.. p
AthineosAudio/Music
& Ellis - Music processing
with FDLP
@ LabROSA
- Dan
Athineos & Ellis - Music processing with FDLP
Ellis
2004-05-25
2004-07-29
2004-05-25
4/16
4/16
Aside:DCT
Spectrogram
of
the
DCT
spectrogram
•
DCT gives a pure-real signal:
•
Looks like a mirror image over time = freq axis
Can we treat it like a waveform?
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
FDLP and TDLP Duality
!,-.
),-.
)*+#
Audio/Music @ LabROSA - Dan Ellis
!"#$%#&'(
2004-07-29
Subband FDLP
•
Time-frequency
Temporal
envelopes without slicing
25 ms windows
Auditory STFT
(10-25ms + Bark bin)
TDLP
(per time frame)
Subband FDLP
(per frequency subband)
Audio/Music @ LabROSA - Dan Ellis
Athineos & Ellis - Music processing with FDLP
2004-07-29
2004-05-25
12/16
Cascade
FDLPTime-Frequency
Applications LP
•
• Time-scale
Analysis modification
••
Temporal equalization
Modulation-domain
• Filtering in frequency “temporal equalization”
Residual
DCT in freq.
Synthesis
OLA
& iDCT
1 sec up to whole sample
Overlap
•
Flat
Temporal
Envelopes
Perceptual
audio features... “PLP-squared”
(temporal equalization)
Athineos & Ellis - Music
processingby
with
FDLP FDLP
= Filtering
inverse
Audio/Music @ LabROSA - Dan Ellis
Athineos & Ellis - Music processing with FDLP
2004-05-25
2004-07-29
2004-05-25
13/16
8/16
3. Music Similarity Browsing
with Adam Berenzweig
• Musical information overload
record companies filter/categorize music
an automatic system would be less odious
• Connecting audio and preference
map to a ‘semantic space’?
n-dimensional
vector in "Anchor
Space"
Anchor
Anchor
Audio
Input
(Class i)
p(a1|x)
AnchorAnchor
Anchor
Audio
Input
(Class j)
p(a2n-dimensional
|x)
vector in "Anchor
Space"
GMM
Modeling
Similarity
Computation
p(a1|x)p(an|x)
p(a2|x)
Anchor
Conversion to Anchorspace
GMM
Modeling
KL-d, EMD, etc.
p(an|x)
Conversion to Anchorspace
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
Anchor Space
• Frame-by-frame high-level categorizations
0
0.6
0.4
0.2
Electronica
fifth cepstral coef
compare to
raw features?
Anchor Space Features
Cepstral Features
0
0.2
0.4
0.6
madonna
bowie
0.8
1
0.5
0
third cepstral coef
5
10
15
0.5
properties in distributions? dynamics?
Audio/Music @ LabROSA - Dan Ellis
madonna
bowie
15
10
Country
2004-07-29
5
‘Playola’ Similarity Browser
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
Ground-truth data
• Hard to evaluate Playola’s ‘accuracy’
user tests...
ground truth?
• “Musicseer” online survey:
ran for 9 months in 2002
> 1,000 users, > 20k judgments
http://labrosa.ee.columbia.edu/
projects/musicsim/
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
Evaluation
Anchor Space measures against
• Compare
Musicseer subjective results
“triplet” agreement percentage
Top-N ranking agreement score:
! " 13
1
!r =
2
N
si = ! "rr"kcr
r=1
!c = !2r
First-place agreement percentage
Top rank agreement
test
- simple significance
80
70
60
SrvKnw 4789x3.58
%
50
SrvAll 6178x8.93
40
GamKnw 7410x3.96
30
GamAll 7421x8.92
20
10
0
cei
cmb
erd
e3d
opn
Audio/Music @ LabROSA - Dan Ellis
kn2
rnd
ANK
2004-07-29
4. Transformation-based models
with Manuel Reyes and Nebojsa Jojic
• HMMs are poor generative models
accurate modeling requires 1000s of states
• Observation:
Speech spectra undergo minor deformations
suggests a different generative model?
9
X9t
Xt-1
8
Xt-1
X8t
Transformation
7
matrix T
Xt-1
X7t
6
Xt-1
Xt6
00100
5
= Xt5
0 0 0 1 0 • Xt-1
4
00001
Xt-1
Xt4
3
X3t
NP=5 Xt-1
2
Xt-1
X2t
1
Xt-1
X1t
Audio/Music @ LabROSA - Dan Ellis
NC=3
2004-07-29
States+Transformation Model
• Time-frequency state grid
→
• State
explicit prototype
a)
•
or a transformation
on prior frame
Infer underlying states
b)
T51
!
!
T14
X51
X50
T!31
X41
X40
T!22
T!21
X31
X30
X21
X20
frequency
X10
T12
time
Yellow/Orange:
Upward motion
(darker is steeper)
3
b) Transformation Map
"
"#$
T!11
Green:
Identity transform
2
Audio/Music @ LabROSA - Dan Ellis
T13
X11
1
a) Signal
%
T!23
Blue:
Downward motion
(darker is steeper)
2004-07-29
Two-layer model
• Source-filter decomposition
pitch and formants have different dynamics
• Apply transformation models for both
log-spectra:
sum of excitation & filter
inference does separation
!#
!"
"
'
(
$
$%&
=
Signal
Selected Bin
+
Harmonics
Harmonic Tracking
Audio/Music @ LabROSA - Dan Ellis
$
$%&
Formants
Formant Tracking
2004-07-29
Transformation model applications
• Compact, accurate source descriptions
only a few explicit states needed
•
a) States
b) Reconstruction; Iter. 1
c) Reconstruction; Iter. 3
• Belief propagation can infer missing values
d) Reconstruction; Iter. 5
e) Reconstruction; Iter. 8
.. of state grid, hence magnitude spectrum
a) Original
b) Missing Data
Audio/Music @ LabROSA - Dan Ellis
c) After iteration 10
d) After iteration 30
2004-07-29
5. Segmenting Personal Audio
with Kean sub Lee
• Easy to record everything you hear
~100GB / year @ 64 kbps
• Very hard to find anything
how to scan?
how to visualize?
how to index?
• Starting point: Collect data
~ 60 hours (8 days, ~7.5 hr/day)
hand-mark 139 segments (26 min/seg avg.)
assign to 41 classes (8 have multiple instances)
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
Features for Long Recordings
• Feature frames = 1 min (not 25 ms!)
• Characterize variation within each frame...
Normalized Energy Deviation
Average Linear Energy
120
15
100
10
80
15
40
10
20
5
5
dB
Average Log Energy
60
dB
Log Energy Deviation
120
15
100
10
80
20
freq / bark
20
freq / bark
60
20
freq / bark
freq / bark
20
5
15
15
10
10
5
5
60
dB
dB
Spectral Entropy Deviation
Average Spectral Entropy
0.9
0.8
15
0.7
10
5
•
0.6
0.5
bits
20
freq / bark
freq / bark
20
0.5
15
0.4
10
0.3
0.2
5
0.1
50
100
150
200
250
300
350
400
450
time / min
and structure within coarse auditory bands
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
bits
BIC Segmentation
• Untrained segmentation technique
statistical test indicates good change points:
log
L(X1 ;M1 )L(X2 ;M2 )
L(X;M0 )
≷
λ
2
log(N )∆#(M )
• Evaluate: 60hr hand-marked boundaries
different features & combinations
Correct Accept % @ False Accept = 2%:
80.8%
81.1%
81.6%
84.0%
83.6%
0.8
0.7
Sensitivity
µdB
µH
σH/µH
µdB + σH/µH
µdB + σH/µH + µH
0.6
0.5
0.3
0.2
0
Audio/Music @ LabROSA - Dan Ellis
µdB
µH
!H/µH
µdB + !H/µH
µdB + µH + !H/µH
0.4
0.005
0.01
0.015
0.02
0.025
1 - Specificity
0.03
0.035
2004-07-29
0.04
Segment clustering
activity has lots of repetition:
• Daily
Automatically cluster similar segments
1
supermkt
meeting
karaoke
barber
lecture2
billiard
break
lecture1
car/taxi
home
bowling
street
restaurant
library
campus
0.5
cmp
lib rst str ...
0
• Spectral clustering achieves ~60% correct
16-way ground truth labels
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
Future Work
• Visualization / browsing / diary inference
link to other information sources
•
• Privacy protection
speaker/speech “search and destroy”
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
Summary
• Today’s topics:
Information
Extraction
Music
Eigenrhythms
Anchor
space
Environment
Personal
audio
Machine Transform
model
Learning
FDLP Signal
Processing
Speech
• + Speech recognition, Meeting recordings
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
LabROSA Summary
• LabROSA
signal processing
+ machine learning
+ info extraction
• Applications
Eigenrhythms: drum pattern models
FDLP temporal envelope models
Music Similarity Browsing
Transformation-based generative models
Personal audio analysis
• Also...
speech recognition, meeting recordings, ...
Audio/Music @ LabROSA - Dan Ellis
2004-07-29
Download