Recognition and Organization of Speech and Audio ROSA

advertisement
Recognition and Organization of
Speech and Audio
Dan Ellis
<dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Electrical Engineering, Columbia University
http://labrosa.ee.columbia.edu/
Outline
1
Audio Organization
2
Speech, music, and other
3
Future work
Dan Ellis
Advent Overview
2002-09-20 - 1
Audio Organization
1
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
0
2
4
6
8
10
12
time/s
-60
level / dB
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
•
Analyzing and describing complex sounds:
- continuous sound mixture
→ distinct objects & events
•
Human listeners as the prototype
- strong subjective impression when listening
- ..but hard to ‘see’ in signal
Dan Ellis
Advent Overview
2002-09-20 - 2
Human Sound Organization
freq / Hz
•
Sound percepts depend on ‘grouping cues’
- common onset across frequency
- common periodicity (fundamental)
- spatial (binaural) cues, familiarity, ...
8000
6000
4000
2000
0
0
1
2
3
4
5
6
7
8
9 time / s
•
Hearing confers evolutionary advantage
- optimized to get ‘useful’ information from sound
•
Auditory perception is ecologically grounded
- scene analysis is preconscious (→ illusions)
Dan Ellis
Advent Overview
2002-09-20 - 3
Speech & Speaker recognition
2
(Patricia Scanlon)
•
-
Mutual Information
identifies where the
information is in time/
frequency:
- little temporal
structure averaged
over all sounds
Phoneme
info
Better with just vowels:
Phoneme
info
Dan Ellis
Speaker
info
Advent Overview
2002-09-20 - 4
Speech Fragment recognition
(Barker & Cooke/Sheffield)
•
Standard classification chooses between
models M to match source features X
P( M )
M ∗ = argmax P ( M X ) = argmax P ( X M ) ⋅ -------------P( X )
M
M
•
Mixtures → observed features Y, segregation S,
all related by P ( X Y , S )
Observation
Y(f )
Source
X(f )
Segregation S
freq
- spectral features allow clean relationship
•
Joint classification of model and segregation:
P( X Y , S )
P ( M , S Y ) = P ( M ) ∫ P ( X M ) ⋅ ------------------------- dX ⋅ P ( S Y )
P( X )
Dan Ellis
Advent Overview
2002-09-20 - 5
The Meeting Recorder Project
(CompSci, ICSI, UW, IDIAP, SRI, IBM)
•
Microphones in conventional meetings
- for summarization/retrieval/behavior analysis
- informal, overlapped speech
•
Data collection (ICSI, UW, IDIAP):
- 100 hours collected, ongoing transcription
•
Dan Ellis
NSF ‘Mapping Meetings’ project
- also interest from NIST, DARPA, EU
Advent Overview
2002-09-20 - 6
Tabletop mics: Turn detection
(Huan Wei Hee, Jerry Liu)
•
4 mics ~ 1m separated along
center of table
- 3 timing differences
- slight U/D offset to
disambiguate
5
4
3
2
1
0
-1
-2
-3
-4
-5
-2.5
•
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Hi-res cross-correlation for timings
- use normalized peak value for confidence
- cluster results
mr-2000-11-02-1440: PZM xcorr lags
Example cross coupling response, chan3 to chan0
1
4
250
0
lag 3-4 / ms
100xR skew/samps
300
200
150
100
3
-2
50
0
-1
2
1
0
50
Dan Ellis
100
150
time / s
200
250
Advent Overview
300
-3
-3
-2
2002-09-20 - 7
-1
0
1
lag 1-2 / ms
2
3
Visualization: transPlotter
(Jerry Liu)
•
Speaker turn patterns are informative
•
Browser for ‘high-level’ view, quick examination
- snack, iwidgets based
Dan Ellis
Advent Overview
2002-09-20 - 8
Music analysis: Structure recovery
(Rob Turetsky, Alex Sheh)
•
Identify & match segment-level structure in music
- Foote similarity matrices
point to repeated
sections
- Chord segmentation
using ASR ‘hidden state’
model (train with chord
transcriptions)
- Note transcription/timing
by aligning audio to
resynthesized MIDI
versions
Dan Ellis
Advent Overview
2002-09-20 - 9
Music Similarity & Recommendation
(Adam Berenzweig, Lawrence/NECI)
Dan Ellis
Advent Overview
2002-09-20 - 10
Sound texture modeling
(Marios Athineos)
•
Sound textures are important but neglected
- fire, rain, paper: no clear pitch, onsets, shape
- typically modeled with filtered white noise
•
LPC on spectrum captures temporal structure:
- better parameterization of texture structure?
40
30
0
20
-5
10
-10
50
100
150
200
250
300
350
-15
-20
8
6
-25
4
-30
-100
40
2
50
100
150
200
250
300
-90
20
-80
-70
-60
0
350
- generate variations?
Dan Ellis
Advent Overview
2002-09-20 - 11
Audio Information Retrieval
(Manuel Reyes)
•
Searching in a database of audio
- speech .. use ASR
- text annotations .. search them
- sound effects library?
•
e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions
- search by proximity in (weighted) feature space
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Segment
feature
analysis
Query example
- features are ‘global’ for each soundfile,
no attempt to separate mixtures
Dan Ellis
Advent Overview
2002-09-20 - 12
Results
freq / Hz
Alarm sound detection
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
freq / Hz
- sinusoid peaks have invariant properties
Speech + alarm
4000
2000
Pr(alarm)
0
1
0.5
0
1
2
3
4
5
6
7
8
9 time / sec
- cepstral coefficients are easy to model
Dan Ellis
Advent Overview
2002-09-20 - 13
2
Future work:
Automatic audio-video analysis
(Shih-Fu Chang, Kathy McKeown)
•
Documentary archive management
- huge ratio of raw-to-finished material
- costly manual logging
•
Problem: term ↔ signal mapping
- training corpus of past annotations
- interactive semi-automatic learning
MM Concept Network
annotations
A/V data
text
processing
A/V
segmentation
and
feature
extraction
terms
A/V
features
concept
discovery
A/V feature
unsupervised
clustering
Multimedia
Fusion:
Mining
Concepts &
Relationships
generic
detectors
Dan Ellis
Advent Overview
Complex
SpatioTemporal
Classification
Question
Answering
2002-09-20 - 14
The ‘Machine listener’
•
Goal: An auditory system for machines
- use same environmental information as people
•
Signal understanding
- monitor for particular sounds
- real-time description
•
Scenarios
- personal listener → summary of your day
- future prosthetic hearing device
- autonomous robots
Dan Ellis
Advent Overview
2002-09-20 - 15
DOMAINS
LabROSA Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
•
•
•
•
•
Dan Ellis
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
Advent Overview
2002-09-20 - 16
Download