General Soundtrack Analysis ROSA )

advertisement
General Soundtrack Analysis
Dan Ellis
<dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Electrical Engineering, Columbia University
http://labrosa.ee.columbia.edu/
Outline
1
LabROSA introduction
2
Broadcast soundtrack monitoring
3
Example technologies
4
Summary
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 1
Lab
ROSA
LabROSA:
Sound Organization
1
freq / Hz
alj0-20-50
8000
6000
4000
2000
0
0
5
10
15
20
25
time / sec
30
Analysis
Voice2
Voice1
Music1
Stab
Music2
Music2 (quiet)
•
Analyzing and describing complex sounds:
- continuous sound mixture
→ distinct objects & events
•
Human listeners as the prototype
- strong subjective impression when listening
- ..but hard to ‘see’ in signal
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 2
Lab
ROSA
The information in sound
freq / Hz
Steps 1
Steps 2
4000
3000
2000
1000
0
0
1
2
3
4
0
1
2
3
4
time / s
•
Hearing confers evolutionary advantage
- optimized to get ‘useful’ information from sound
•
Enormous detail is available in familiar sounds
- ‘ecological’ influence on our sense of hearing
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 3
Lab
ROSA
Audio Information Extraction
AutomaticSpeech
Recognition
ASR
Text
Information
Retrieval
IR
Spoken
Document
Retrieval
SDR
Text
Information
Extraction
IE
Blind Source
Separation
BSS
Independent
Component
Analysis
ICA
Music
Information
Retrieval
Music IR
Audio
Information
Extraction
AIE
Computational
Auditory
Scene Analysis
CASA
Audio
Fingerprinting
Audio
Content-based
Retrieval
Audio CBR
Unsupervised
Clustering
•
Domain
- text ... speech ... music ... general audio
•
Operation
- recognize ... index/retrieve ... organize
Soundtrack @ FBIS - Dan Ellis
Multimedia
Information
Retrieval
MMIR
2002-03-12 - 4
Lab
ROSA
DOMAINS
LabROSA Summary
• Meetings
• Personal recordings
• Location monitoring
• HCI
• Broadcast
• Movies
• Lectures
• Music
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speaker description
• Nonspeech recognition
• Scene Analysis
• Audio-visual integration
• Music analysis
• Multimedia access & search
• Personal media management
• Machine perception & awareness
• Prostheses / human augmentation
• Automatic judgments/recommendation
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 5
Lab
ROSA
Outline
1
LabROSA introduction
2
Broadcast soundtrack monitoring
- Available information
- Information filtering
- Intelligent analysis
3
Example technologies
4
Summary
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 6
Lab
ROSA
2
Broadcast soundtrack monitoring:
What information is available?
•
Video and soundtrack reinforce, complement
- hard things in one domain can be easy in other
•
Information at different scales:
generic
dialog
style
sound
events
words
specific
program
genre
activity
level
speaker
emotion
story
coverage
schedule
changes
speaker ID
seconds
days
timescale
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 7
Lab
ROSA
Information filtering
•
Maximizing analyst utility:
information
selection
•
Automatic support for ‘triage’:
- segmentation
- inferring schedules
- identify repeated segments
- generic categorization/labeling
- task-specific flagging
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 8
Lab
ROSA
Automatic labeling/flagging
•
Computer as vigilant monitor
Signal
Models
Monitoring
computer
Flagged
events
•
Range of tasks
- generic ‘unusual’ patterns
- specific trigger events
•
Issues
- false alarms vs. misses
- combinations of simple detectors for higher-level
classes e.g. program type
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 9
Lab
ROSA
Schedule summarization
•
Learn the patterns of a particular broadcaster:
- 24/7 monitoring
- automatic segmentation into programs
- automatic classification of genres
M
T
W
Th
F
S
Su
0000
News
0600
Talk
Sports
1200
Music
Other
1800
2400
•
Support information extraction
- program types, repetitions, summarization
- schedule changes → activity?
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 10
Lab
ROSA
Outline
1
LabROSA introduction
2
Broadcast soundtrack monitoring
3
Example technologies
- Sound enhancement
- Locating repetitions
- Labeling soundtracks
4
Summary
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 11
Lab
ROSA
Example technologies:
Sound enhancement
3
•
Time scale
modification
freq / Hz
Original speech
8000
6000
4000
2000
Sped up by 50%
8000
-
speed up for rapid
skimming
6000
4000
2000
0
Slowed down to 50%
8000
-
slow down for
close listening
6000
4000
2000
0
Noise reduced at pk - 20dB
8000
6000
•
Noise reduction
4000
2000
0
0
Soundtrack @ FBIS - Dan Ellis
1
2
3
4
5
2002-03-12 - 12
6
7
8
Lab
ROSA
9
time / sec
Repetition detection
•
Matching signals is expensive
- but matching strings is fast
•
Represent audio as simple string
- recognition into arbitrary classes
- match detection becomes string matching
Models
perc
spee
cell
soundtrack
anim
viol
mach
Recognizer
v
p
a
v
c
a
detection
String match
target
•
Recognizer
p a c p v a s
Monitor for many ‘signatures’
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 13
Lab
ROSA
Soundtrack labeling
•
Broad class: speech-music
freq / Hz
alj0-20-50
8000
6000
4000
Pr(mus)
2000
0
1
0.5
0
0
freq / Hz
•
5
10
15
20
25
time / sec
30
Specific events: alarms
Speech + alarm
4000
2000
Pr(alarm)
0
1
0.5
0
1
2
Soundtrack @ FBIS - Dan Ellis
3
4
5
6
7
8
9 time / sec
2002-03-12 - 14
Lab
ROSA
Speaking style
•
Recognizing information other than words
•
Meeting Recorder project: Locate overlaps
backchannel
(signals desire to regain floor?)
floor seizure
mr-2000-06-30-1600
Spkr A
speaker
active
speake
cedes f
Spkr B
Spkr C
interruptions
Spkr D
breath
noise
Spkr E
crosstalk
level/dB
40
Table
top
20
0
120
125
130
135
140
145
150
155
time / secs
mr-2000-11-02
C
A
O
M
J
L
0
5
10
•
15
20
25
30
35
40
45
50
55
60
Speaker emotion
- depends on good baseline
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 15
Lab
ROSA
Outline
1
LabROSA introduction
2
Broadcast soundtrack monitoring
3
Example technologies
4
Summary
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 16
Lab
ROSA
4
Summary: Soundtrack analysis
•
Soundtrack carries information
- useful and detailed
- complementary to image
- rapid processing
•
Open source monitoring
- Need to find the interesting bits
- Short-time: specific detectors
- Long-time: schedule, genre classification
•
Current techniques
- Signal enhancement
- Detectors for speech, music, alarms etc.
- Classification of interaction style, emotion
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 17
Lab
ROSA
Extra slides
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 18
Lab
ROSA
Automatic Speech Recognition (ASR)
•
Standard speech recognition structure:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Acoustic
classifier
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
•
‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
•
Can use multiple streams...
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 19
Lab
ROSA
The Meeting Recorder project
(with ICSI, UW, SRI, IBM)
•
Microphones in conventional meetings
- for summarization/retrieval/behavior analysis
- informal, overlapped speech
•
Data collection (ICSI, UW, ...):
- 100 hours collected, ongoing transcription
- headsets + tabletop + ‘PDA’
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 20
Lab
ROSA
Crosstalk cancellation
•
Baseline speaker activity detection is hard:
backchannel
(signals desire to regain floor?)
floor seizure
mr-2000-06-30-1600
Spkr A
speaker
active
speaker B
cedes floor
Spkr B
Spkr C
interruptions
Spkr D
breath
noise
Spkr E
crosstalk
level/dB
40
Table
top
120
20
0
125
130
135
140
145
150
155
time / secs
•
Noisy crosstalk model: m = C ⋅ s + n
•
Estimate subband CAa from A’s peak energy
- ... including pure delay (10 ms frames)
- ... then linear inversion
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 21
Lab
ROSA
freq / Hz
Alarm sound detection
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
•
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
representation of energy in time-frequency
formation of atomic elements
grouping by common properties (onset &c.)
classify by attributes...
Key: recognize despite background
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 22
Lab
ROSA
Audio Information Retrieval
(with Manuel Reyes)
•
Searching in a database of audio
- speech .. use ASR
- text annotations .. search them
- sound effects library?
•
e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions
- search by proximity in (weighted) feature space
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
- features are ‘global’ for each soundfile,
no attempt to separate mixtures
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 23
Lab
ROSA
Audio Retrieval: Results
•
Musclefish corpus
- most commonly reported set
•
Features
- mfcc, brightness, bandwidth, pitch ...
- no temporal sequence structure
•
Results:
- 208 examples, 16 classes, 84% correct
- confusions:
Instr
Musical instrs.
Spch
Env
Anim
Mech
136 (14)
Speech
17 (7)
Eviron.
2
Animals
2
Mechanical
1
Soundtrack @ FBIS - Dan Ellis
2
6 (1)
2
1 (0)
15 (2)
2002-03-12 - 24
Lab
ROSA
CASA for audio retrieval
•
When audio material contains mixtures,
global features are insufficient
•
Retrieval based on element/object analysis:
Generic
element
analysis
Continuous audio
archive
Object
formation
Objects + properties
Element representations
Generic
element
analysis
Seach/
comparison
Results
(Object
formation)
Query example
rushing water
Symbolic query
Word-to-class
mapping
Properties alone
- features are calculated over grouped subsets
Soundtrack @ FBIS - Dan Ellis
2002-03-12 - 25
Lab
ROSA
Download