Segmenting and Classifying Long-Duration Recordings of “Personal Audio” Dan Ellis and Keansub Lee

advertisement
Segmenting and Classifying
Long-Duration Recordings of
“Personal Audio”
Dan Ellis and Keansub Lee
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
{dpwe,kslee}@ee.columbia.edu
1.
2.
3.
4.
5.
“Personal Audio”
Features
Segmentation
Clustering
Future Work
Segmenting Personal Audio - Ellis & Lee
2004-10-03
1. Personal Audio
• Easy to record everything you hear
<2GB / week
@ 64 kbps
hard to
• Very
find anything
how to scan?
how to visualize?
how to index?
• Need automatic analysis
Segmenting Personal Audio - Ellis & Lee
2004-10-03
Applications
• Automatic appointment-book history
fills in when & where of movements
• “Life statistics”
how long did I spend in meetings this week vs. last
most frequent conversations
favorite phrases??
• Retrieving details
what exactly did I promise?
privacy issues...
• Nostalgia?
Segmenting Personal Audio - Ellis & Lee
2004-10-03
Data Set
• Starting point: Collect data
62 hours recorded (8 days, ~7.5 hr/day)
hand-mark 139 segments (26 min/seg avg.)
assign to 16 classes (11 have multiple instances)
Label
Library
Campus
Restaurant
Bowling
Lecture 1
Car/Taxi
Street
total mins
981
750
560
244
234
165
162
Segmenting Personal Audio - Ellis & Lee
total segs
27
56
5
2
4
7
16
2004-10-03
2. Features
duration recordings
• Long
may benefit from longer basic time-frames
60s rather than 10ms?
• Perceptually-motivated features
broad spectrum + some detail?
• For diary application...
background more important than foreground?
smooth out uncharacteristic transients
Segmenting Personal Audio - Ellis & Lee
2004-10-03
Feature sets
Normalized Energy Deviation
Average Linear Energy
120
15
100
10
80
15
40
10
20
5
5
dB
Average Log Energy
60
dB
Log Energy Deviation
120
15
100
10
80
20
freq / bark
20
freq / bark
60
20
freq / bark
freq / bark
20
5
15
15
10
10
5
5
60
dB
dB
Spectral Entropy Deviation
Average Spectral Entropy
0.9
0.8
15
0.7
10
0.6
5
0.5
bits
20
freq / bark
freq / bark
20
0.5
15
0.4
10
0.3
0.2
5
0.1
50
100
150
200
250
300
350
400
• Capture both average and variation
• Capture a little more detail in subbands...
Segmenting Personal Audio - Ellis & Lee
450
time / min
2004-10-03
bits
Spectral Entropy
NF
• Auditory spectrum: A[n, j] = ! w X[n, k]
•• Spectral entropy ≈ ‘peakiness’ of each band:
!
"
jk
k=0
NF
energy / dB
w jkX[n, k]
w jkX[n, k]
H[n, j] = − !
· log
A[n, j]
k=0 A[n, j]
FFT spectral magnitude
0
-20
Auditory Spectrum
-40
-60
rel. entropy / bits
0
1000
2000
3000
4000
5000
6000
7000
8000
0.5
0
per-band
Spectral Entropies
-0.5
-1
30 340 750 1130 1630
2280
3220 3780
Segmenting Personal Audio - Ellis & Lee
4470
5280
6250
7380
freq / Hz
2004-10-03
3. BIC segmentation
• BIC (Bayesian Information Criterion):
Compare more and less complex models
log
L(X1 ;M1 )L(X2 ;M2 )
L(X;M0 )
• For segmentation:
≷
λ
2
log(N )∆#(M )
Grow context window from current boundary
For each window, test every possible segmentation
When BIC is positive, mark new segment
candidate
boundary
last
segmentation point
current
context limit
0
N
L(X1;M1)
time
L(X2;M2)
L(X;M0)
Segmenting Personal Audio - Ellis & Lee
2004-10-03
BIC Segmentation Example
BIC score
AvgLogAudSpec
2004-09-10-1023_AvgLEnergy
20
15
10
5
boundary
passes BIC
last
seg
point
0
-100
no boundary found
with shorter window
-200
13:30
14:00
14:30
15:00
• No training or stored models
Segmenting Personal Audio - Ellis & Lee
current
window
limit
15:30
16:00
time / hr
2004-10-03
Segmentation Results
• Evaluate: 60hr hand-marked boundaries
different features & combinations
Correct Accept % @ False Accept = 2%:
Correct Accept
80.8%
81.1%
81.6%
84.0%
83.6%
73.6%
0.8
0.7
Sensitivity
Feature
µdB
µH
σH/µH
µdB + σH/µH
µdB + σH/µH + µH
avg. mfcc
0.6
0.5
µdB
µH
!H/µH
µdB + !H/µH
µdB + µH + !H/µH
0.4
0.3
0.2
0
Segmenting Personal Audio - Ellis & Lee
0.005
0.01
0.015
0.02
0.025
1 - Specificity
0.03
2004-10-03
0.035
0.04
4. Segment clustering
activity has lots of repetition:
• Daily
Automatically cluster similar segments
‘affinity’ of segments as KL2 distances
1
supermkt
meeting
karaoke
barber
lecture2
billiard
break
lecture1
car/taxi
home
bowling
street
restaurant
library
campus
0.5
cmp
Segmenting Personal Audio - Ellis & Lee
lib rst str ...
0
2004-10-03
Spectral Clustering
• Eigenanalysis of affinity matrix: A = U•S•V’
SVD components: uk•skk•vk'
Affinity Matrix
900
800
800
600
700
400
600
200
k=1
k=2
k=3
k=4
500
400
800
300
600
200
400
100
200
200
400
600
800
200 400 600 800
200 400 600 800
eigenvectors vk give cluster memberships
• Number of clusters?
Segmenting Personal Audio - Ellis & Lee
2004-10-03
Clustering Results
of automatic segments gives
• Clustering
‘anonymous classes’
BIC criterion to choose number of clusters
make best correspondence to 16 GT clusters
freq / Bark
Hand-marked
boundaries
20
15
10
5
21:00
22:00
23:00
0:00
1:00
2:00
3:00
4:00
Clock time
• Frame-level scoring gives ~70% correct
errors when same ‘place’ has multiple ambiences
clusters formed by strong foregrounds (voices)
Segmenting Personal Audio - Ellis & Lee
2004-10-03
5. Future Work
• Visualization / browsing / diary inference
link in other information sources
Segmenting Personal Audio - Ellis & Lee
2004-10-03
Privacy
• Recording conversations conflicts with
expectations of privacy
critical barrier to progress
• Technical solutions to improve acceptance?
Speaker/speech “search and destroy”
scramble 100ms segs of speech
(preserving longer-term statistics)
high-confidence speaker ID to bypass
Segmenting Personal Audio - Ellis & Lee
2004-10-03
Conclusions
• “Personal Audio” is easy & cheap to collect
but is it any use?
• Boundaries quite easy to spot
moving to a new location
change in activity (talking <> reading)
• Repeated activities can cluster together
.. so user’s labels can propagate
• Still gaining experience with the data
speech is the most interesting part
- .. but very hard to transcribe
speaker ID, privacy, ...
Segmenting Personal Audio - Ellis & Lee
2004-10-03
Download