Semantic Audio Analysis Organizing Sound Mixtures Applications for Audio Semantics

advertisement
Semantic Audio Analysis
1 Semantic Audio Analysis
2 Organizing Sound Mixtures
3 Applications for Audio Semantics
4 Open Questions
Dan Ellis <dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Columbia University, New York
http://labrosa.ee.columbia.edu/
Dan Ellis
Semantic Audio Analysis (1 of 19)
2003-10-13
Semantic Audio Analysis
1
•
Audio Semantics
= what is the meaning / message?
- “AI complete”?
•
“Semantics” is broad!
- used for the-stuff-we-can’t-do-yet
•
How about speech recognition?
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
0
2
4
6
8
10
12
time/s
-60
level / dB
IT'S NICE TO HAVE THIS FRIDAY WAS IN DOLLARS IN THE
BOMBING RAIDS WAS CLOSING THOUSAND TO FOUR
GUNMEN CAUSING CONDITIONING SAID THE STOCK
ROSE SMOKERS FROM THE TWENTY NINE NINETY FIVE
PLUS THREE ON THE EXPERT TECHNICIANS EXPECTED
TO REACH
- no good even if it worked!
Dan Ellis
Semantic Audio Analysis (2 of 19)
2003-10-13
Towards Semantic Audio Analysis
•
What do we want from SAA?
- describe sound in human-recognizable terms
- “automatic subtitles for real life”?
[thunder]
S: I THINK THE
WEATHER'S
CHANGING
•
Key step in speech recognition is classification
- convert continuous signal into discrete classes
- need a rich, application-dependent vocabulary
- hierarchy of pattern recognition
•
Listeners perceive sound sources
- If SAA primitives are to be subjective percepts...
- ...source segregation is the first problem?
Dan Ellis
Semantic Audio Analysis (3 of 19)
2003-10-13
SAA Applications
•
Subjective descriptions are the
ultimate sound representation
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
0
-
•
Dan Ellis
2
4
6
8
10
12
time/s
-60
level / dB
data compression: “Radio ad for AARCO”
signal enhancement: “... without noise”
modification: “.. woman’s voice ..”
.. needs both analysis and synthesis?
Sound understanding useful for:
- indexing/retrieval
- robots
- prostheses?
Semantic Audio Analysis (4 of 19)
2003-10-13
Outline
1
Semantic Audio Analysis
2
Organizing Sound Mixtures
- Human Auditory Scene Analysis
- Organizing Mixtures by Computer
3
Applications for Audio Semantics
4
Open Questions
Dan Ellis
Semantic Audio Analysis (5 of 19)
2003-10-13
Auditory Scene Analysis
2
freq / Hz
(Bregman 1990)
•
How do people analyze sound mixtures?
- break mixture into small elements (in time-freq)
- elements are grouped in to sources using cues
- sources have aggregate attributes
•
Elements + attributes
8000
6000
4000
2000
0
0
1
2
3
4
5
6
7
8
9 time / s
- common onset ( = dependent origins)
- periodicity ( = single process)
- + spatial cues etc. + familiarity, context ...
Dan Ellis
Semantic Audio Analysis (6 of 19)
2003-10-13
Computational Auditory Scene Analysis:
The Representational Approach
(Cooke & Brown 1993)
•
input
mixture
Direct implementation of psych. theory
signal
features
Front end
(maps)
Object
formation
discrete
objects
Source
groups
Grouping
rules
freq
onset
time
period
frq.mod
- ‘bottom-up’ processing
- uses common onset & periodicity cues
•
frq/Hz
Able to extract voiced speech:
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
Dan Ellis
0.6
0.8
1.0
time/s
Semantic Audio Analysis (7 of 19)
0.2
0.4
0.6
0.8
2003-10-13
1.0
time/s
Approaches to handling sound mixtures
•
Separate signals, then recognize
- Computational Auditory Scene Analysis (CASA),
Independent Component Analysis
- nice, if you can make it work
•
Recognize combined signal
- ‘multicondition training’
- combinatorics seem daunting
•
Recognize with parallel models
- optimal inference from full joint state-space
p ( O, x , y ) Æ p ( x , y O )
- or: skip obscured fragments,
infer from higher-level context
- or do both: missing-data recognition
Dan Ellis
Semantic Audio Analysis (8 of 19)
2003-10-13
Multi-source decoding
(Barker, Cooke & Ellis 2003)
•
Missing Data recognizes from a subset;
Can search for more than one source
q2(t)
S2(t)
Y(t)
S1(t)
q1(t)
•
Mutually-dependent data masks
•
Use e.g. CASA features to propose masks
- locally coherent regions
•
Issues in models, representations, inference...
Dan Ellis
Semantic Audio Analysis (9 of 19)
2003-10-13
Outline
1
Semantic Audio Analysis
2
Organizing Sound Mixtures
3
Applications for Audio Semantics
- Meeting recordings
- Audio diary analysis
- Semantics of musical signals
4
Open Questions
Dan Ellis
Semantic Audio Analysis (10 of 19)
2003-10-13
The Meeting Recorder Project
3
(with ICSI, UW, IDIAP, SRI, Sheffield)
•
Microphones in conventional meetings
- for summarization / retrieval / behavior analysis
- informal, overlapped speech (Æ ASR...)
•
Behavioral: Look for patterns of speaker turns
Participant
mr04: Hand-marked speaker turns vs. time + auto/manual boundaries
10:
9:
8:
7:
5:
3:
2:
1:
0
5
Dan Ellis
10
15
20
25
30
35
40
Semantic Audio Analysis (11 of 19)
45
50
55
60
time/min
2003-10-13
freq / Hz
Personal Audio: The Listening Machine
•
Smart PDA records everything
you hear
•
Only useful if we have index,
summaries
- semantic descriptions (real time?)
•
Features appropriate for 1 minute segments...
4
2
freq / Bark
0
50
100
150
200
250 time / min
15
10
5
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
clock time
- Bark band variance, spectral entropy, ...
Dan Ellis
Semantic Audio Analysis (12 of 19)
2003-10-13
Music Information Retrieval
(Berenzweig & Ellis 2003)
•
Apply search concepts to music?
- “musical Google” – beat human annotation?
- application: finding new music
•
Construct music space where near = “similar”
Dan Ellis
Semantic Audio Analysis (13 of 19)
2003-10-13
Outline
1
Semantic Audio Analysis
2
Organizing Sound Mixtures
3
Applications for Audio Semantics
4
Open Questions
Dan Ellis
Semantic Audio Analysis (14 of 19)
2003-10-13
Open Questions
4
•
Semantics
- What are the abstract perceptual attributes of a
sound?
How can we describe them?
•
Mixtures
- How do people organize sound mixtures into
separate source percepts?
- How can we represent generic sound
knowledge?
•
Applications
- Search/retrieval: What terminology is most
natural for users querying sound databases?
- What is the best balance between machine and
human listening?
- What are the problems for which machine
listening can be most useful?
Dan Ellis
Semantic Audio Analysis (15 of 19)
2003-10-13
Extra slides
Dan Ellis
Semantic Audio Analysis (16 of 19)
2003-10-13
Missing Data Recognition
(Barker, Cooke & Ellis ’03)
•
Can evaluate speech models
p(x|m) over a subset of
dimensions xk
p ( xk m ) =
•
Ú p ( x k, x u
m ) dx u
xu
p(xk,xu)
y
p(xk|xu<y )
xk
p(xk )
Hence, missing data recognition:
Present data mask
P(x | q) =
dimension
P(x1 | q)
· P(x2 | q)
· P(x3 | q)
· P(x4 | q)
· P(x5 | q)
· P(x6 | q)
.. but need
segregation mask
time
•
Fit model and segregation given obs’n:
P( X Y , S )
P ( M , S Y ) = P ( M ) Ú P ( X M ) ◊ ------------------------- dX ◊ P ( S Y )
P( X )
Dan Ellis
Semantic Audio Analysis (17 of 19)
2003-10-13
What a speech HMM contains
•
Markov model structure: states + transitions
S
A
0.1
0.02
T
0.9
4
Model M'2
0.8
0.9
K
freq / kHz
0.8
0.18
0.2
3
0.8
0.05
2
A
0.15
20
T
0.2
30
E
40
O
0
•
10
0.8
0.1
1
O
0.1
0.9
K
S
E
0.1
freq / kHz
M'1
State Transition Probabilities
State models (means)
0.9
50
10
A generative model
- but not a good speech generator!
4
3
2
1
0
0
1
2
3
4
5
time / sec
- only meant for inference of p(X|M)
Dan Ellis
Semantic Audio Analysis (18 of 19)
2003-10-13
20
30
40
50
Music similarity from Anchor space
•
A classifier trained for one artist (or genre)
will respond partially to a similar artist
•
Each artist evokes a particular pattern of
responses over a set of classifiers
•
We can treat these classifier outputs as a new
feature space in which to estimate similarity
n-dimensional
vector in "Anchor
Space"
Anchor
Anchor
p(a1|x)
Audio
Input
(Class i)
AnchorAnchor
Anchor
Audio
Input
(Class j)
|x)
p(a2n-dimensional
vector in "Anchor
Space"
GMM
Modeling
Similarity
Computation
p(a1|x)p(an|x)
p(a2|x)
Anchor
Conversion to Anchorspace
GMM
Modeling
KL-d, EMD, etc.
p(an|x)
Conversion to Anchorspace
•
Dan Ellis
“Anchor space” reflects subjective qualities?
Semantic Audio Analysis (19 of 19)
2003-10-13
Download