Audio Information Extraction ROSA )

advertisement
Audio Information Extraction
Dan Ellis
<dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Electrical Engineering, Columbia University
http://labrosa.ee.columbia.edu/
Outline
1
Audio Information Extraction
2
Speech, music, and other
3
General sound organization
4
Future work & summary
AIE @ MERL - Dan Ellis
2002-01-07 - 1
Lab
ROSA
1
Audio Information Extraction (AIE)
4000
frq/Hz
3000
2000
1000
0
0
2
4
6
8
10
12
time/s
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
•
Central operation:
- continuous sound mixture
→ distinct objects & events
•
Perceptual impression is very strong
- but hard to ‘see’ in signal
AIE @ MERL - Dan Ellis
2002-01-07 - 2
Lab
ROSA
Perceptual organization: Bregman’s lake
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
•
Received waveform is a mixture
- two sensors, N signals ...
•
Disentangling mixtures as primary goal
- perfect solution is not possible
- need knowledge-based constraints
AIE @ MERL - Dan Ellis
2002-01-07 - 3
Lab
ROSA
The information in sound
freq / Hz
Steps 1
Steps 2
4000
3000
2000
1000
0
0
1
2
3
4
0
1
2
3
4
time / s
•
A sense of hearing is evolutionarily useful
- gives organisms ‘relevant’ information
•
Auditory perception is ecologically grounded
- scene analysis is preconscious (→ illusions)
- special-purpose processing reflects
‘natural scene’ properties
- subjective not canonical (ambiguity)
AIE @ MERL - Dan Ellis
2002-01-07 - 4
Lab
ROSA
Positioning AIE
AutomaticSpeech
Recognition
ASR
Text
Information
Retrieval
IR
Spoken
Document
Retrieval
SDR
Text
Information
Extraction
IE
Blind Source
Separation
BSS
Independent
Component
Analysis
ICA
Music
Information
Retrieval
Music IR
Audio
Information
Extraction
AIE
Computational
Auditory
Scene Analysis
CASA
Audio
Fingerprinting
Audio
Content-based
Retrieval
Audio CBR
Unsupervised
Clustering
•
Domain
- text ... speech ... music ... general audio
•
Operation
- recognize ... index/retrieve ... organize
AIE @ MERL - Dan Ellis
Multimedia
Information
Retrieval
MMIR
2002-01-07 - 5
Lab
ROSA
AIE Applications
•
Multimedia access
- sound as complementary dimension
- need all modalities for complete information
•
Personal audio
- continuous sound capture quite practical
- different kind of indexing problem
•
Machine perception
- intelligence requires awareness
- necessary for communication
•
Music retrieval
- area of hot activity
- specific economic factors
AIE @ MERL - Dan Ellis
2002-01-07 - 6
Lab
ROSA
Outline
1
Audio Information Extraction
2
Speech, music, and other
- Speech recognition
- Multi-speaker processing
- Music classification
- Other sounds
3
General sound organization
4
Future work & summary
AIE @ MERL - Dan Ellis
2002-01-07 - 7
Lab
ROSA
Automatic Speech Recognition (ASR)
•
Standard speech recognition structure:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Acoustic
classifier
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
•
‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
•
Can use multiple streams...
AIE @ MERL - Dan Ellis
2002-01-07 - 8
Lab
ROSA
Tandem speech recognition
(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI)
•
Neural net estimates phone posteriors;
but Gaussian mixtures model finer detail
•
Combine them!
Hybrid Connectionist-HMM ASR
Feature
calculation
Conventional ASR (HTK)
Noway
decoder
Neural net
classifier
Feature
calculation
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
s
ah
t
tn+w
Input
sound
Words
Phone
probabilities
Speech
features
Input
sound
Subword
likelihoods
Speech
features
Tandem modeling
Feature
calculation
Neural net
classifier
HTK
decoder
Gauss mix
models
h#
pcl
bcl
tcl
dcl
C0
C1
C2
s
Ck
tn
ah
t
tn+w
Input
sound
•
Speech
features
Phone
probabilities
Subword
likelihoods
Words
Train net, then train GMM on net output
- GMM is ignorant of net output ‘meaning’
AIE @ MERL - Dan Ellis
2002-01-07 - 9
Lab
ROSA
Words
Tandem system results:
Aurora ‘noisy digits’
(with Manuel Reyes)
Aurora 2 Eurospeech 2001 Evaluation
Columbia
Rel improvement % - Multicondition
60
Philips
UPC Barcelona
Bell Labs
50
IBM
40
Motorola 1
Motorola 2
30
Nijmegen
ICSI/OGI/Qualcomm
20
ATR/Griffith
AT&T
Alcatel
10
Siemens
UCLA
Microsoft
0
Avg. rel. improvement
-10
Slovenia
Granada
•
50% of word errors corrected over baseline
•
Beat even ‘bells and whistles’ system
using intensive large-vocabulary techniques
AIE @ MERL - Dan Ellis
2002-01-07 - 10
Lab
ROSA
Missing data recognition
(Cooke, Green, Barker... @ Sheffield)
•
Energy overlaps in time-freq. hide features
- some observations are effectively missing
•
Use missing feature theory...
- integrate over missing data dimensions xm
p( x q) =
•
∫ p( x p
x m, q ) p ( x m q ) d x m
Effective in speech recognition
- trick is finding good/bad data mask
"1754" + noise
90
AURORA 2000 - Test Set A
80
Missing-data
Recognition
WER
70
MD Discrete SNR
MD Soft SNR
HTK clean training
HTK multi-condition
60
50
40
30
"1754"
20
10
Mask based on
stationary noise estimate
AIE @ MERL - Dan Ellis
0
-5
0
5
10
SNR (dB)
15
2002-01-07 - 11
20
Clean
Lab
ROSA
The Meeting Recorder project
(with ICSI, UW, SRI, IBM)
•
Microphones in conventional meetings
- for summarization/retrieval/behavior analysis
- informal, overlapped speech
•
Data collection (ICSI, UW, ...):
- 100 hours collected, ongoing transcription
- headsets + tabletop + ‘PDA’
AIE @ MERL - Dan Ellis
2002-01-07 - 12
Lab
ROSA
Crosstalk cancellation
•
Baseline speaker activity detection is hard:
backchannel
(signals desire to regain floor?)
floor seizure
mr-2000-06-30-1600
Spkr A
speaker
active
speaker B
cedes floor
Spkr B
Spkr C
interruptions
Spkr D
breath
noise
Spkr E
crosstalk
level/dB
40
Table
top
120
20
0
125
130
135
140
145
150
155
time / secs
•
Noisy crosstalk model: m = C ⋅ s + n
•
Estimate subband CAa from A’s peak energy
- ... including pure delay (10 ms frames)
- ... then linear inversion
AIE @ MERL - Dan Ellis
2002-01-07 - 13
Lab
ROSA
Speaker localization
(with Wei Hee Huan)
•
Tabletop mics form an array;
time differences locate speakers
5
4
3
2
1
0
-1
-2
-3
-4
-5
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
•
Ambiguity:
- mic positions not fixed
- geometric symmetry
•
Detect speaker activity, overlap
AIE @ MERL - Dan Ellis
2002-01-07 - 14
Lab
ROSA
Music analysis: Structure recovery
(with Rob Turetsky)
•
Structure recovery by similarity matrices
(after Foote)
- similarity distance
measure?
- segmentation &
repetition structure
- interpretation at different
scales:
notes, phrases,
movements
- incorporating musical
knowledge:
‘theme similarity’
AIE @ MERL - Dan Ellis
2002-01-07 - 15
Lab
ROSA
Music analysis: Lyrics extraction
(with Adam Berenzweig)
•
Vocal content is highly salient,
useful for retrieval
•
Can we find the singing?
Use an ASR classifier:
freq / kHz
music (no vocals #1)
singing (vocals #17 + 10.5s)
4
2
0
singing
phone class
Posteriogram Spectrogram
speech (trnset #58)
0
1
2
time / sec
0
1
2
time / sec
0
1
2
•
Frame error rate ~20% for segmentation based
on posterior-feature statistics
•
Lyric segmentation + transcribed lyrics
→ training data for lyrics ASR...
AIE @ MERL - Dan Ellis
2002-01-07 - 16
Lab
ROSA
time / sec
Artist similarity
•
Train network to discriminate specific artists:
Mouse
w60o40 stats based on LE plp12 2001-12-28
50
100
150
0
100
200
300
50
100
150
200
250
0
0
50
100
150
200
250 0
200
400
600
Artist class
0
100
200
300
Tori
XTC
Beck
0
100
200
300
0
100
200
300
400
500
t
0
•
Focus on vocal segments for consistency
•
.. then clustering for recommendation
AIE @ MERL - Dan Ellis
2002-01-07 - 17
Lab
ROSA
freq / Hz
Alarm sound detection
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
•
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
representation of energy in time-frequency
formation of atomic elements
grouping by common properties (onset &c.)
classify by attributes...
Key: recognize despite background
AIE @ MERL - Dan Ellis
2002-01-07 - 18
Lab
ROSA
Sound textures
(with Marios Athineos)
•
Textures: Large class of sounds
- no clear pitch, onsets, shape
- fire, rain, paper, machines, ...
- ‘bulk’ subjective properties
•
Abstract & synthesize by:
- project into low-dimensional parameter space
- learn dynamics within space
- generate endless versions
40
30
0
20
-5
10
-10
50
100
150
200
250
300
350
-15
-20
8
6
-25
4
-30
-100
40
2
50
100
150
200
AIE @ MERL - Dan Ellis
250
300
-90
20
-80
-70
-60
0
350
2002-01-07 - 19
Lab
ROSA
Outline
1
Audio Information Extraction
2
Speech, music, and other
3
General sound organization
- Computational Auditory Scene Analysis
- Audio Information Retrieval
4
Future work & summary
AIE @ MERL - Dan Ellis
2002-01-07 - 20
Lab
ROSA
Computational Auditory
Scene Analysis (CASA)
CASA
Object 1 description
Object 2 description
Object 3 description
...
•
Goal: Automatic sound organization ;
Systems to ‘pick out’ sounds in a mixture
- ... like people do
•
E.g. voice against a noisy background
- to improve speech recognition
•
Approach:
- psychoacoustics describes grouping ‘rules’
- ... just implement them?
AIE @ MERL - Dan Ellis
2002-01-07 - 21
Lab
ROSA
CASA front-end processing
•
Correlogram:
Loosely based on known/possible physiology
de
lay
l
ine
short-time
autocorrelation
sound
Cochlea
filterbank
frequency channels
Correlogram
slice
envelope
follower
freq
lag
time
-
linear filterbank cochlear approximation
static nonlinearity
zero-delay slice is like spectrogram
periodicity from delay-and-multiply detectors
AIE @ MERL - Dan Ellis
2002-01-07 - 22
Lab
ROSA
Adding top-down cues
Perception is not direct
but a search for plausible hypotheses
•
Data-driven (bottom-up)...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
Grouping
rules
Source
groups
vs. Prediction-driven (top-down) (PDCASA)
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
•
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
Motivations
- detect non-tonal events (noise & click elements)
- support ‘restoration illusions’...
→ hooks for high-level knowledge
+ ‘complete explanation’, multiple hypotheses, ...
AIE @ MERL - Dan Ellis
2002-01-07 - 23
Lab
ROSA
PDCASA and complex scenes
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
AIE @ MERL - Dan Ellis
1
2
3
4
5
6
7
8
9
time/s
2002-01-07 - 24
dB
Lab
ROSA
Audio Information Retrieval
(with Manuel Reyes)
•
Searching in a database of audio
- speech .. use ASR
- text annotations .. search them
- sound effects library?
•
e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions
- search by proximity in (weighted) feature space
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
- features are ‘global’ for each soundfile,
no attempt to separate mixtures
AIE @ MERL - Dan Ellis
2002-01-07 - 25
Lab
ROSA
Audio Retrieval: Results
•
Musclefish corpus
- most commonly reported set
•
Features
- mfcc, brightness, bandwidth, pitch ...
- no temporal sequence structure
•
Results:
- 208 examples, 16 classses, 84% correct
- confusions:
Instr
Musical instrs.
Spch
Env
Anim
Mech
136 (14)
Speech
17 (7)
Eviron.
2
Animals
2
Mechanical
1
AIE @ MERL - Dan Ellis
2
6 (1)
2
1 (0)
15 (2)
2002-01-07 - 26
Lab
ROSA
CASA for audio retrieval
•
When audio material contains mixtures,
global features are insufficient
•
Retrieval based on element/object analysis:
Generic
element
analysis
Continuous audio
archive
Object
formation
Objects + properties
Element representations
Generic
element
analysis
Seach/
comparison
Results
(Object
formation)
Query example
rushing water
Symbolic query
Word-to-class
mapping
Properties alone
- features are calculated over grouped subsets
AIE @ MERL - Dan Ellis
2002-01-07 - 27
Lab
ROSA
Outline
1
Audio Information Extraction
2
Speech, music, and other
3
General sound organization
4
Future work & summary
AIE @ MERL - Dan Ellis
2002-01-07 - 28
Lab
ROSA
Automatic audio-video analysis
(with Prof. Shih-Fu Chang, Prof. Kathy McKeown)
•
Documentary archive management
- huge ratio of raw-to-finished material
- costly manual logging
- missed opportunities for cross-fertilization
•
Problem: term <-> signal mapping
- training corpus of past annotations
- interactive semi-automatic learning
- need object-related features
MM Concept Network
annotations
A/V data
text
processing
A/V
segmentation
and
feature
extraction
terms
A/V
features
concept
discovery
A/V feature
unsupervised
clustering
generic
detectors
AIE @ MERL - Dan Ellis
Multimedia
Fusion:
Mining
Concepts &
Relationships
Complex
SpatioTemporal
Classification
Question
Answering
2002-01-07 - 29
Lab
ROSA
The ‘Machine listener’
•
Goal: An auditory system for machines
- use same environmental information as people
•
Signal understanding
- monitor for particular sounds
- real-time description
•
Scenarios
- personal listener → summary of your day
- future prosthetic hearing device
- autonomous robots
AIE @ MERL - Dan Ellis
2002-01-07 - 30
Lab
ROSA
DOMAINS
LabROSA Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
•
•
•
•
•
AIE @ MERL - Dan Ellis
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
2002-01-07 - 31
Lab
ROSA
Summary: Audio Info Extraction
•
Sound carries information
- useful and detailed
- often tangled in mixtures
•
Various important general classes
- Speech: activity, recognition
- Music: segmentation, clustering
- Other: detection, description
•
General processing framework
- Computational Auditory Scene Analysis
- Audio Information Retrieval
•
Future applications
- Ubiquitous intelligent indexing
- Intelligent monitoring & description
AIE @ MERL - Dan Ellis
2002-01-07 - 32
Lab
ROSA
Audio Information Extraction:
panacea or punishment?
AIE @ MERL - Dan Ellis
2002-01-07 - 33
Lab
ROSA
Download