Audio Information Extraction ROSA )

advertisement
Audio Information Extraction
Dan Ellis
<dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Electrical Engineering, Columbia University
http://labrosa.ee.columbia.edu/
Outline
1
Sound organization
2
Speech, music, and other
3
General sound organization
4
Future work & summary
AIE @ JHU - Dan Ellis
2002-03-28 - 1
Lab
ROSA
Sound Organization
1
4000
frq/Hz
3000
2000
1000
0
0
2
4
6
8
10
12
time/s
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
•
Analyzing and describing complex sounds:
- continuous sound mixture
→ distinct objects & events
•
Human listeners as the prototype
- strong subjective impression when listening
- ..but hard to ‘see’ in signal
AIE @ JHU - Dan Ellis
2002-03-28 - 2
Lab
ROSA
Bregman’s lake
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
•
Received waveform is a mixture
- two sensors, N signals ...
•
Disentangling mixtures as primary goal
- perfect solution is not possible
- need knowledge-based constraints
AIE @ JHU - Dan Ellis
2002-03-28 - 3
Lab
ROSA
The information in sound
freq / Hz
Steps 1
Steps 2
4000
3000
2000
1000
0
0
1
2
3
4
0
1
2
3
4
time / s
•
Hearing confers evolutionary advantage
- optimized to get ‘useful’ information from sound
•
Auditory perception is ecologically grounded
- scene analysis is preconscious (→ illusions)
- special-purpose processing reflects
‘natural scene’ properties
- subjective not canonical (ambiguity)
AIE @ JHU - Dan Ellis
2002-03-28 - 4
Lab
ROSA
Audio Information Extraction
AutomaticSpeech
Recognition
ASR
Text
Information
Retrieval
IR
Spoken
Document
Retrieval
SDR
Text
Information
Extraction
IE
Blind Source
Separation
BSS
Independent
Component
Analysis
ICA
Music
Information
Retrieval
Music IR
Audio
Information
Extraction
AIE
Computational
Auditory
Scene Analysis
CASA
Audio
Fingerprinting
Audio
Content-based
Retrieval
Audio CBR
Unsupervised
Clustering
•
Domain
- text ... speech ... music ... general audio
•
Operation
- recognize ... index/retrieve ... organize
AIE @ JHU - Dan Ellis
Multimedia
Information
Retrieval
MMIR
2002-03-28 - 5
Lab
ROSA
AIE Applications
•
Multimedia access
- sound as complementary dimension
- need all modalities for complete information
•
Personal audio
- continuous sound capture quite practical
- different kind of indexing problem
•
Machine perception
- intelligence requires awareness
- necessary for communication
•
Music retrieval
- area of hot activity
- specific economic factors
AIE @ JHU - Dan Ellis
2002-03-28 - 6
Lab
ROSA
Outline
1
Sound organization
2
Speech, music, and other
- Speech recognition
- Multi-speaker processing
- Music classification
- Other sounds
3
General sound organization
4
Future work & summary
AIE @ JHU - Dan Ellis
2002-03-28 - 7
Lab
ROSA
Automatic Speech Recognition (ASR)
•
Standard speech recognition structure:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Acoustic
classifier
Word models
s
ah
phone probabilities
t
HMM
decoder
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
phone / word
sequence
Understanding/
application...
•
‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
•
Can use multiple streams...
AIE @ JHU - Dan Ellis
2002-03-28 - 8
Lab
ROSA
Tandem speech recognition
(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI)
•
Neural net estimates phone posteriors;
but Gaussian mixtures model finer detail
•
Combine them!
Hybrid Connectionist-HMM ASR
Feature
calculation
Conventional ASR (HTK)
Noway
decoder
Neural net
classifier
Feature
calculation
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
s
ah
t
tn+w
Input
sound
Words
Phone
probabilities
Speech
features
Input
sound
Subword
likelihoods
Speech
features
Tandem modeling
Feature
calculation
Neural net
classifier
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
tn+w
Input
sound
•
Speech
features
Phone
probabilities
Subword
likelihoods
Words
Train net, then train GMM on net output
- GMM is ignorant of net output ‘meaning’
AIE @ JHU - Dan Ellis
2002-03-28 - 9
Lab
ROSA
Words
Tandem system results
•
It works very well (‘Aurora’ noisy digits):
WER as a function of SNR for various Aurora99 systems
100
Average WER ratio to baseline:
HTK GMM:
100%
Hybrid:
84.6%
Tandem:
64.5%
Tandem + PC: 47.2%
WER / % (log scale)
50
20
10
5
HTK GMM baseline
Hybrid connectionist
Tandem
Tandem + PC
2
1
-5
0
System-features
5
10
15
SNR / dB (averaged over 4 noises)
20
clean
Avg. WER 20-0 dB
Baseline WER ratio
HTK-mfcc
13.7%
100%
Neural net-mfcc
9.3%
84.5%
Tandem-mfcc
7.4%
64.5%
Tandem-msg+plp
6.4%
47.2%
AIE @ JHU - Dan Ellis
2002-03-28 - 10
Lab
ROSA
Relative contributions
•
Approx relative impact on baseline WER ratio
for different component:
Combo over msg:
+20%
plp
Neural net
classifier
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
tn
tn+w
Input
sound
Pre-nonlinearity over +
posteriors: +12%
PCA
orthog'n
HTK
decoder
Gauss mix
models
s
msg
Neural net
classifier
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
KLT over
direct:
+8%
ah
Combo-into-HTK over
Combo-into-noway:
+15%
t
Words
tn
Combo over plp:
+20%
tn+w
Combo over mfcc:
+25%
NN over HTK:
+15%
Tandem over HTK:
+35%
Tandem over hybrid:
+25%
Tandem combo over HTK mfcc baseline: +53%
AIE @ JHU - Dan Ellis
2002-03-28 - 11
Lab
ROSA
Inside Tandem systems:
What’s going on?
•
Visualizations of the net outputs
Clean
3
3
2
2
1
1
0
0
10
10
5
5
0
0
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
q
h#
ax
uw
ow
ao
ah
ay
ey
eh
ih
iy
w
r
n
v
th
f
z
s
kcl
tcl
k
t
freq / kHz
4
phone
Spectrogram
5dB SNR to ‘Hall’ noise
4
phone
“one eight three”
(MFP_183A)
10
0
-10
-20
-30
-40 dB
Cepstral-smoothed
mel spectrum
Phone posterior
estimates
Hidden layer
linear outputs
freq / mel chan
7
5
4
0
•
6
0.2
0.4
0.6
time / s
0.8
1
3
1
0.5
0
20
10
0
-10
0
0.2
0.4
0.6
time / s
0.8
1
-20
Neural net normalizes away noise
AIE @ JHU - Dan Ellis
2002-03-28 - 12
Lab
ROSA
Tandem for large vocabulary recognition
•
CI Tandem front end + CD LVCSR back end
PLP feature
calculation
Neural net
classifier 1
h#
pcl
bcl
tcl
dcl
C0
C1
Tandem
feature
calculation
SPHINX
recognizer
C2
Ck
tn
tn+w
+
Pre-nonlinearity
outputs
Input
sound
MSG feature
calculation
PCA
decorrelation
HMM
decoder
GMM
classifier
Subword
likelihoods
Neural net
classifier 2
C1
ah
t
Words
MLLR
adaptation
h#
pcl
bcl
tcl
dcl
C0
s
C2
Ck
tn
tn+w
•
Tandem benefits reduced:
75
mfc
plp
tandem1
tandem2
70
65
WER / %
60
55
50
45
40
35
30
AIE @ JHU - Dan Ellis
Context
Independent
Context
Dependent
Context Dep.
+MLLR
2002-03-28 - 13
Lab
ROSA
‘Tandem-domain’ processing
(with Manuel Reyes)
•
Can we improve the ‘tandem’ features with
conventional processing (deltas,
normalization)?
Neural net
...
Norm /
deltas
?
h#
pcl
bcl
tcl
dcl
C0
C1
C2
Ck
tn
PCA
decorrelation
Rank
reduce
?
Norm /
deltas
?
GMM
classifier
...
tn+w
•
Somewhat..
Processing
Avg. WER 20-0 dB
Baseline WER ratio
Tandem PLP mismatch baseline (24 els)
11.1%
70.3%
Rank reduce @ 18 els
11.8%
77.1%
Delta → PCA
9.7%
60.8%
PCA → Norm
9.0%
58.8%
Delta → PCA → Norm
8.3%
53.6%
AIE @ JHU - Dan Ellis
2002-03-28 - 14
Lab
ROSA
Tandem vs. other approaches
Aurora 2 Eurospeech 2001 Evaluation
Columbia
Rel improvement % - Multicondition
60
Philips
UPC Barcelona
Bell Labs
50
IBM
40
Motorola 1
Motorola 2
30
Nijmegen
ICSI/OGI/Qualcomm
20
ATR/Griffith
AT&T
Alcatel
10
Siemens
UCLA
Microsoft
0
Avg. rel. improvement
-10
Slovenia
Granada
•
50% of word errors corrected over baseline
•
Beat ‘bells and whistles’ system
using large-vocabulary techniques
AIE @ JHU - Dan Ellis
2002-03-28 - 15
Lab
ROSA
Missing data recognition
(Cooke, Green, Barker @ Sheffield)
•
Energy overlaps in time-freq. hide features
- some observations are effectively missing
•
Use missing feature theory...
- integrate over missing data xm under model M
p( x M ) =
•
∫ p( x p
x m, M ) p ( x m M ) d x m
Effective in speech recognition
- trick is finding good/bad data mask
"1754" + noise
90
AURORA 2000 - Test Set A
80
Missing-data
Recognition
WER
70
MD Discrete SNR
MD Soft SNR
HTK clean training
HTK multi-condition
60
50
40
"1754"
Mask based on
stationary noise estimate
30
20
10
0
-5
AIE @ JHU - Dan Ellis
0
5
10
2002-03-28 - 16
15
Lab
ROSA
20
Clean
Multi-source decoding
(Jon Barker @ Sheffield)
•
Search of sound-fragment interpretations
"1754" + noise
Mask based on
stationary noise estimate
Mask split into subbands
`Grouping' applied within bands:
Spectro-Temporal Proximity
Common Onset/Offset
Multisource
Decoder
"1754"
- also search over data mask K:
p ( M, K x ) ∝ p ( x K , M ) p ( K M ) p ( M )
•
CASA for masks → larger fragments → faster
AIE @ JHU - Dan Ellis
2002-03-28 - 17
Lab
ROSA
The Meeting Recorder project
(with ICSI, UW, SRI, IBM)
•
Microphones in conventional meetings
- for summarization/retrieval/behavior analysis
- informal, overlapped speech
•
Data collection (ICSI, UW, ...):
- 100 hours collected, ongoing transcription
mr-2000-11-02
C
A
O
M
J
L
0
5
10
15
AIE @ JHU - Dan Ellis
20
25
30
35
40
45
50
2002-03-28 - 18
55
60
Lab
ROSA
Crosstalk cancellation
(with Sam Keene)
•
Baseline speaker activity detection is hard:
backchannel
(signals desire to regain floor?)
floor seizure
mr-2000-06-30-1600
Spkr A
speaker
active
speaker B
cedes floor
Spkr B
Spkr C
interruptions
Spkr D
breath
noise
Spkr E
crosstalk
level/dB
40
Table
top
120
20
0
125
130
135
140
145
150
155
time / secs
•
Noisy crosstalk model: m = C ⋅ s + n
•
Estimate subband CAa from A’s peak energy
- ... then linear inversion
AIE @ JHU - Dan Ellis
2002-03-28 - 19
Lab
ROSA
Speaker localization
(with Huan Wei Hee)
•
Tabletop mics form an array;
time differences locate speakers
Inferred talker positions (x=mic)
mr-2000-11-02-1440: PZM xcorr lags
1
4
lag 3-4 / ms
0
1
-1
4
2
z/m
0.5
3
-2
-3
-3
0
2
2
1
-2
-1
0
1
lag 1-2 / ms
2
3
1
1
0
y/m
•
Ambiguity:
- mic positions not fixed
- speaker motions
•
Detect speaker activity, overlap
AIE @ JHU - Dan Ellis
2
3
-1
0 x/m
-2
2002-03-28 - 20
-2
Lab
ROSA
freq / Hz
Alarm sound detection
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
freq / Hz
- sinusoid peaks have invariant properties
Speech + alarm
4000
2000
Pr(alarm)
0
1
0.5
0
1
2
3
4
5
6
7
8
9 time / sec
- cepstral coefficients are easy to model
AIE @ JHU - Dan Ellis
2002-03-28 - 21
Lab
ROSA
Music analysis: Structure recovery
(with Rob Turetsky)
•
Structure recovery by similarity matrices
(after Foote)
- similarity distance
measure?
- segmentation &
repetition structure
- interpretation at different
scales:
notes, phrases,
movements
- incorporating musical
knowledge:
‘theme similarity’
AIE @ JHU - Dan Ellis
2002-03-28 - 22
Lab
ROSA
Music analysis: Lyrics extraction
(with Adam Berenzweig)
•
Vocal content is highly salient,
useful for retrieval
•
Can we find the singing?
Use an ASR classifier:
freq / kHz
music (no vocals #1)
singing (vocals #17 + 10.5s)
4
2
0
singing
phone class
Posteriogram Spectrogram
speech (trnset #58)
0
1
2
time / sec
0
1
2
time / sec
0
1
2
•
Frame error rate ~20% for segmentation based
on posterior-feature statistics
•
Lyric segmentation + transcribed lyrics
→ training data for lyrics ASR...
AIE @ JHU - Dan Ellis
2002-03-28 - 23
Lab
ROSA
time / sec
Artist similarity
•
Train network to discriminate specific artists:
Mouse
w60o40 stats based on LE plp12 2001-12-28
50
100
150
0
100
200
300
50
100
150
200
250
0
0
50
100
150
200
250 0
200
400
600
Artist class
0
100
200
300
Tori
XTC
Beck
0
100
200
300
0
100
200
300
400
500
t
0
•
Focus on vocal segments for consistency
- accuracy 57% → 65%
•
.. then clustering for recommendation
AIE @ JHU - Dan Ellis
2002-03-28 - 24
Lab
ROSA
Outline
1
Sound organization
2
Speech, music, and other
3
General sound organization
- Computational Auditory Scene Analysis
- Audio Information Retrieval
4
Future work & summary
AIE @ JHU - Dan Ellis
2002-03-28 - 25
Lab
ROSA
Computational Auditory
Scene Analysis (CASA)
CASA
Object 1 description
Object 2 description
Object 3 description
...
•
Goal: Automatic sound organization ;
Systems to ‘pick out’ sounds in a mixture
- ... like people do
•
E.g. voice against a noisy background
- to improve speech recognition
•
Approach:
- psychoacoustics describes grouping ‘rules’
- ... can we implement them?
AIE @ JHU - Dan Ellis
2002-03-28 - 26
Lab
ROSA
CASA front-end processing
•
Correlogram:
Loosely based on known/possible physiology
de
lay
l
ine
short-time
autocorrelation
sound
frequency channels
Correlogram
slice
Cochlea
filterbank
envelope
follower
freq
lag
time
-
linear filterbank cochlear approximation
static nonlinearity
zero-delay slice is like spectrogram
periodicity from delay-and-multiply detectors
AIE @ JHU - Dan Ellis
2002-03-28 - 27
Lab
ROSA
Adding top-down cues
Perception is not direct
but a search for plausible hypotheses
•
Data-driven (bottom-up)...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
Grouping
rules
Source
groups
vs. Prediction-driven (top-down) (PDCASA)
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
•
Motivations
- detect non-tonal events (noise & click elements)
- support ‘restoration illusions’...
•
Machine Learning for sound models
- corpus of isolated sounds?
AIE @ JHU - Dan Ellis
2002-03-28 - 28
Lab
ROSA
PDCASA and complex scenes
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
AIE @ JHU - Dan Ellis
1
2
3
4
5
6
7
8
9
time/s
2002-03-28 - 29
dB
Lab
ROSA
Audio Information Retrieval
(with Manuel Reyes)
•
Searching in a database of audio
- speech .. use ASR
- text annotations .. search them
- sound effects library?
•
e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions
- search by proximity in (weighted) feature space
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
- features are ‘global’ for each soundfile,
no attempt to separate mixtures
AIE @ JHU - Dan Ellis
2002-03-28 - 30
Lab
ROSA
Audio Retrieval: Results
•
Musclefish corpus
- most commonly reported set
•
Features
- mfcc, brightness, bandwidth, pitch ...
- no temporal sequence structure
•
Results:
- 208 examples, 16 classes
Global features: 41% corr
HMM models: 81% corr.
Mu
Musical
Sp
59/
46
11/ 6
Speech
Env
An
Mec
Mu
24
2
19
136/
6
4
5
1
Eviron.
7/ 2
Animals
2
1/
2
4
1
8/ 4
Mechan
1
AIE @ JHU - Dan Ellis
1
Sp
14/ 2
Env
An
Mec
2
1
5
5
3
1
7/
1
4/
3
3
2002-03-28 - 31
1
12/
Lab
ROSA
What are the HMM states?
•
No sub-units defined for nonspeech sounds
•
Final states depend on EM initialization
- labels
- clusters
- transition matrix
•
Have ideas of what we’d like to get
- investigate features/initialization to get there
dogBarks2
freq / kHz
s7
s3
s2 s4
s2 s3
s7 s2
s3s5
s2
s3 s2 s4 s3 s4
s3 s7
s4
8
7
6
5
4
3
2
1
time
1.15
1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2
AIE @ JHU - Dan Ellis
2002-03-28 - 32
Lab
ROSA
CASA for audio retrieval
•
When audio material contains mixtures,
global features are insufficient
•
Retrieval based on element/object analysis:
Generic
element
analysis
Continuous audio
archive
Object
formation
Objects + properties
Element representations
Generic
element
analysis
Seach/
comparison
Results
(Object
formation)
Query example
rushing water
Symbolic query
Word-to-class
mapping
Properties alone
- features are calculated over grouped subsets
AIE @ JHU - Dan Ellis
2002-03-28 - 33
Lab
ROSA
Outline
1
Sound organization
2
Speech, music, and other
3
General sound organization
4
Future work & summary
AIE @ JHU - Dan Ellis
2002-03-28 - 34
Lab
ROSA
Automatic audio-video analysis
(with Prof. Shih-Fu Chang, Prof. Kathy McKeown)
•
Documentary archive management
- huge ratio of raw-to-finished material
- costly manual logging
- missed opportunities for cross-fertilization
•
Problem: term ↔ signal mapping
- training corpus of past annotations
- interactive semi-automatic learning
- need object-related features
MM Concept Network
annotations
A/V data
text
processing
A/V
segmentation
and
feature
extraction
terms
A/V
features
concept
discovery
A/V feature
unsupervised
clustering
generic
detectors
AIE @ JHU - Dan Ellis
Multimedia
Fusion:
Mining
Concepts &
Relationships
Complex
SpatioTemporal
Classification
Question
Answering
2002-03-28 - 35
Lab
ROSA
The ‘Machine listener’
•
Goal: An auditory system for machines
- use same environmental information as people
•
Signal understanding
- monitor for particular sounds
- real-time description
•
Scenarios
- personal listener → summary of your day
- future prosthetic hearing device
- autonomous robots
AIE @ JHU - Dan Ellis
2002-03-28 - 36
Lab
ROSA
DOMAINS
LabROSA Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
•
•
•
•
•
AIE @ JHU - Dan Ellis
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
2002-03-28 - 37
Lab
ROSA
Summary: Audio Info Extraction
•
Sound carries information
- useful and detailed
- often tangled in mixtures
•
Various important general classes
- Speech: activity, recognition
- Music: segmentation, clustering
- Other: detection, description
•
General processing framework
- Computational Auditory Scene Analysis
- Audio Information Retrieval
•
Future applications
- Ubiquitous intelligent indexing
- Intelligent monitoring & description
AIE @ JHU - Dan Ellis
2002-03-28 - 38
Lab
ROSA
Download