Searching and Describing Audio Databases 1. 2.

advertisement
Searching and Describing
Audio Databases
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
http://labrosa.ee.columbia.edu/
1. Music browsing
2. Meeting recordings
3. ‘Personal Audio’
Searching/Describing Audio - Dan Ellis
2005-05-23 - 1 /30
LabROSA Overview
Information
Extraction
Music
Similarity
Browsing
Machine
Learning
Environment
Audio
lifelog
Meeting
recordings
Signal
Processing
Speech
Searching/Describing Audio - Dan Ellis
2005-05-23 - 2 /30
Sound Description
• Challenge: Indexing continuous content
what are the equivalents of words and phrases?
• Perception:
sounds as discrete events
• Goal: Duplicate perceptual representation
Searching/Describing Audio - Dan Ellis
2005-05-23 - 3 /30
1. Music Signal Analysis
• A lot of music data available
e.g. 60G of MP3
≈ 1000 hr of audio/15k tracks
• What can we do with it?
identify implicit structure...
• Quality vs. quantity
Speech recognition lesson:
10x data, 1/10th annotation, twice as useful
• Motivating Applications
music search / browsing by similarity
insight into music
Searching/Describing Audio - Dan Ellis
2005-05-23 - 4 /30
Musical Information Extraction
Similarity/
recommend'n
Anchor
models
Semantic
bases
Music
audio
Melody
extraction
Drums
extraction
Event
extraction
Fragment
clustering
Eigenrhythms
Synthesis/
generation
?
• Lots of data
+ noisy transcription
+ weak clustering
musical insights?
Searching/Describing Audio - Dan Ellis
2005-05-23 - 5 /30
Transcription as Classification
with Graham Poliner
• Signal models typically used for transcription
harmonic spectrum, superposition
• But ... trade domain knowledge for data
transcription as pure classification problem:
Audio
Trained
classifier
p("C0"|Audio)
p("C#0"|Audio)
p("D0"|Audio)
p("D#0"|Audio)
p("E0"|Audio)
p("F0"|Audio)
single N-way discrimination for “melody”
per-note classifiers for polyphonic transcription
Searching/Describing Audio - Dan Ellis
2005-05-23 - 6 /30
Classifier Transcription Results
• Trained on MIDI
syntheses (32 songs)
SMO SVM (Weka)
• Tested on MIREX set
Frame-level pitch concordance
system
fg+bg
just fg
“jazz3”
71.5%
56.1%
foreground/bg separation
Searching/Describing Audio - Dan Ellis
2005-05-23 - 7 /30
overall
44.3%
45.4%
Chord Transcription
• Basic problem:
with Alex Sheh
Recover chord sequence labels from audio
Easier than note transcription ?
More relevant to listener perception ?
Searching/Describing Audio - Dan Ellis
2005-05-23 - 8 /30
EM HMM Re-Estimation
• Estimate ‘soft’ labels using current models
• Update model parameters from new labels
• Repeat until convergence to local maximum
Model inventory
ae1
ae2
dh1
dh2
ae3
Labelled training data
dh ax k ae t
Initialization Θ
init
parameters
Repeat until
convergence
s ae t aa n
dh
s ae
E-step:
probabilities p(qni |XN
1,Θold)
of unknowns
M-step:
maximize via
parameters
Searching/Describing Audio - Dan Ellis
ax
k
t
dh
ae
t
aa n
ax
k
Uniform
initialization
alignments
ae
Θ : max E[log p(X,Q | Θ)]
2005-05-23 - 9 /30
Chord Sequence Data Sources
• All we need are the chord sequences
for our training examples
Hal Leonard “Paperback Song Series”
- manually retyped for 20 songs:
“Beatles for Sale”, “Help”, “Hard Day’s Night”
freq / kHz
The Beatles - Hard Day's Night
# The Beatles - A Hard Day's Night
#
G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G
G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G
Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G
F6 G C D G C9 G D
G C7 G F6 G C7 G F6 G C D G C9 G Bm Em Bm
G Em C D
G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G
C9 G Cadd9 Fadd9
5
4
3
2
1
0
0
5
10
15
20
25
time / sec
- hand-align chords for 2 test examples
Searching/Describing Audio - Dan Ellis
2005-05-23 - 10/30
Chord Results
• Recognition weak, but forced-alignment OK
Frame-level Accuracy
Recognition Alignment
MFCC
8.7%
PCP_ROT
21.7%
#
G
#
22.0%
F
E
pitch class
Feature
Beatles - Beatles For Sale - Eight Days a Week (4096pt)
76.0%
#
D
#
(random ~3%)
MFCCs are poor
(can overtrain)
PCPs better
(ROT helps generalization)
C
B
#
inte
A
16.27
true
align
recog
Searching/Describing Audio - Dan Ellis
24.84
time / sec
E
G
D
E
G
DBm
E
G
Bm
Bm
G
G
Am
Em7 Bm
2005-05-23 - 11/30
Em7
Eigenrhythms: Drum Pattern Space
with John Arroyo
• Pop songs built on repeating “drum loop”
bass drum, snare, hi-hat
small variations on a few basic patterns
• Eigen-analysis (PCA) to capture variations?
by analyzing lots of (MIDI) data
• Applications
music categorization
“beat box” synthesis
Searching/Describing Audio - Dan Ellis
2005-05-23 - 12/30
Eigenrhythms
• Need 20+ Eigenvectors for good coverage
of 100 training patterns (1200 dims)
• Top patterns:
Searching/Describing Audio - Dan Ellis
2005-05-23 - 13/30
Eigenrhythms for Classification
• Projections in Eigenspace / LDA space
PCA(1,2) projection (16% corr)
LDA(1,2) projection (33% corr)
10
6
blues
4
country
disco
hiphop
2
house
newwave
rock 0
pop
punk
-2
rnb
5
0
-5
-10
-20
-10
0
10
-4
-8
-6
-4
-2
• 10-way Genre classification (nearest nbr):
PCA3: 20% correct
LDA4: 36% correct
Searching/Describing Audio - Dan Ellis
2005-05-23 - 14/30
0
2
Music Similarity Browsing
• Musical information overload
with Adam Berenzweig
record companies filter/categorize music
an automatic system would be less odious
• Connecting audio and preference
map to a ‘semantic space’?
n-dimensional
vector in "Anchor
Space"
Anchor
Anchor
Audio
Input
(Class i)
p(a1|x)
AnchorAnchor
Anchor
Audio
Input
(Class j)
p(a2n-dimensional
|x)
vector in "Anchor
Space"
GMM
Modeling
Similarity
Computation
p(a1|x)p(an|x)
p(a2|x)
Anchor
Conversion to Anchorspace
GMM
Modeling
KL-d, EMD, etc.
p(an|x)
Conversion to Anchorspace
Searching/Describing Audio - Dan Ellis
2005-05-23 - 15/30
Anchor Space
• Frame-by-frame high-level categorizations
0
0.6
0.4
0.2
Electronica
fifth cepstral coef
compare to
raw features?
Anchor Space Features
Cepstral Features
0
0.2
0.4
0.6
madonna
bowie
0.8
1
0.5
0
third cepstral coef
5
10
15
15
0.5
properties in distributions? dynamics?
Searching/Describing Audio - Dan Ellis
madonna
bowie
10
Country
2005-05-23 - 16/30
5
‘Playola’ Similarity Browser
Searching/Describing Audio - Dan Ellis
2005-05-23 - 17/30
2. Meeting Recordings
with Jerry Liu and ICSI
• Novel application for speech processing
summarize & review content of meetings
• What really happens in meetings?
analysis of decision making, participant roles
• Multi-mic recordings for speaker turns
e.g. ad-hoc sensor setups (multiple PDAs)
Searching/Describing Audio - Dan Ellis
2005-05-23 - 18/30
Speaker Turns from Timing Diffs
• Find best timing skew between mic pairs
• Find clusters in high-confidence points
• Fit Gaussians to each cluster,
assign that class to all frames within radius
ICSI0: good points
All pts: nearest class
All pts: closest dimension
0
0
0
-20
-20
-20
-40
-40
-40
-60
-60
-60
-80
-80
-80
-100
-100
-50
0
-100
-100
-50
Searching/Describing Audio - Dan Ellis
0
-100
-100
-50
0
2005-05-23 - 19/30
Browsing Meeting Recordings
• Information in patterns of speaker turns
Searching/Describing Audio - Dan Ellis
2005-05-23 - 20/30
3. “Personal Audio”
• Easy to record everything you hear
~100GB / year @ 64 kbps
• Very hard to find anything
with Keansub Lee
how to scan?
how to visualize?
how to index?
• Starting point: Collect data
~ 60 hours (8 days, ~7.5 hr/day)
hand-mark 139 segments (26 min/seg avg.)
assign to 16 classes (8 have multiple instances)
Searching/Describing Audio - Dan Ellis
2005-05-23 - 21/30
Applications
• Automatic appointment-book history
fills in when & where of movements
• “Life statistics”
how long did I spend in meetings this week vs. last
most frequent conversations
favorite phrases??
• Retrieving details
what exactly did I promise?
privacy issues...
• Nostalgia?
Searching/Describing Audio - Dan Ellis
2005-05-23 - 22/30
Features for Long Recordings
• Feature frames = 1 min (not 25 ms!)
• Characterize variation within each frame...
Normalized Energy Deviation
Average Linear Energy
120
15
100
10
80
15
40
10
20
5
5
dB
Average Log Energy
60
dB
Log Energy Deviation
120
15
100
10
80
20
freq / bark
20
freq / bark
60
20
freq / bark
freq / bark
20
5
15
15
10
10
5
5
60
dB
dB
Spectral Entropy Deviation
Average Spectral Entropy
0.9
0.8
15
0.7
10
5
•
0.6
0.5
bits
20
freq / bark
freq / bark
20
0.5
15
0.4
10
0.3
0.2
5
0.1
50
100
150
200
250
300
350
400
450
time / min
and structure within coarse auditory bands
Searching/Describing Audio - Dan Ellis
2005-05-23 - 23/30
bits
Segment clustering
• Daily activity has lots of repetition:
Automatically cluster similar segments
‘affinity’ of segments as KL2 distances
1
supermkt
meeting
karaoke
barber
lecture2
billiard
break
lecture1
car/taxi
home
bowling
street
restaurant
library
campus
0.5
cmp
Searching/Describing Audio - Dan Ellis
lib rst str ...
0
2005-05-23 - 24/30
Spectral Clustering
• Eigenanalysis of affinity matrix: A = U•S•V’
SVD components: uk•skk•vk'
Affinity Matrix
900
800
800
600
700
400
600
200
k=1
k=2
k=3
k=4
500
400
800
300
600
200
400
100
200
200
400
600
800
200 400 600 800
200 400 600 800
eigenvectors vk give cluster memberships
• Number of clusters?
Searching/Describing Audio - Dan Ellis
2005-05-23 - 25/30
Clustering Results
• Clustering of automatic segments gives
‘anonymous classes’
BIC criterion to choose number of clusters
make best correspondence to 16 GT clusters
• Frame-level scoring gives ~70% correct
errors when same ‘place’ has multiple ambiences
Searching/Describing Audio - Dan Ellis
2005-05-23 - 26/30
Privacy
• Recordings affront expectations of privacy
critical barrier to progress
• Technical solutions? Speech Scrambling
freq / kHz
freq / kHz
scramble 200ms segs of speech (long term OK)
high-confidence speaker ID to bypass
4
Original (dan+kean-ex.wav)
2
20
0
0
-20
4
Scrambled (200ms wins over 1s)
-40
-60
level / dB
2
0
0
2
4
6
Searching/Describing Audio - Dan Ellis
8
10
12
14 time / s
2005-05-23 - 27/30
Visualization and Browsing
• Visualization /
browsing /
diary inference
link in other
information sources
- diary
- email
• NoteTaker interface:
“what was I hearing?”
Searching/Describing Audio - Dan Ellis
2005-05-23 - 28/30
Personal Audio: Current Work
• Voice segment detection / identification
poor and variable SNR
channel variation
• Novelty detection
segments that don’t cluster against archive
• Spatial information
identifying relative motion of head/sources
.. for motion detection
.. for source separation
Searching/Describing Audio - Dan Ellis
2005-05-23 - 29/30
LabROSA Summary
• LabROSA
signal processing
+ machine learning
+ information extraction
• Applications
Music: Transcription, Recommendation
Speech: Recognition, Organization
Environment: Detection, Description
• Also...
signal separation, compression, dolphins...
Searching/Describing Audio - Dan Ellis
2005-05-23 - 30/30
Download