ThesisProposal

advertisement
Semantics from Sound:
Modeling Audio and Text
Thesis Proposal
May 31, 2006
Douglas Turnbull
Department of Computer Science & Engineering
UC San Diego
Committee: Charles Elkan, Gert Lanckriet, Serge Belogie, Sanjoy
Dasgupta, Shlomo Dubnov
Describing What We Hear
Sound carries rich information from which we derive
semantic understanding:
• Interpreter translating the speech of a foreign dignitary
• Analyst critiquing the tone of a political debate
• Critic writing a music review
• Movie director describing a sound effect
1
Computer Audition
We are interested in developing computer-based
systems that can listen to and describe sound.
Sound Classes:
–
–
–
–
Speech
Music
Sound Effects
Environmental Sounds
2
Computer Audition
We are interested in developing computer-based
systems that can listen to and describe sound.
Common Tasks:
–
–
–
–
Speech-to-text recognition
Speaker characterization - emotion, tone, accents
Music information retrieval
Monitoring using sound - sonar, bird migration
We will primarily focus on non-speech audio research.
3
Semantic Audio Annotation and Retrieval
We would like a system that can both
• annotate audio content with semantically
meaningful words
• retrieve relevant audio given a text-based query
We learn a probabilistic model using a heterogeneous
data set of audio and text.
•
•
This can be framed as a supervised or unsupervised
learning problem.
Our initial approach uses supervised multi-class naïve
Bayes model.
4
Semantic Audio Annotation and Retrieval
This work involves 3 main components:
1. Heterogeneous data modeling
2. Audio data representation
1. Extracting features from short-time segments
2. Integrating to derive medium-time features
3. Modeling long-time (track-level) aspects of audio
Audio representation will depend on the class of audio.
5
Semantic Audio Annotation and Retrieval
This work involves 3 main components:
1. Heterogeneous data modeling
2. Audio data representation
3. Text processing
1. Identifying the semantics of sound - vocabulary selection
2. Processing of music review documents or sound effect
captions.
6
Outline
•
•
•
•
Related work
“Modeling Music with Words”
Future research directions
Thesis logistics
7
Outline
•
•
•
•
Related work
“Modeling Music with Words”
Future research directions
Thesis logistics
8
Semantic-Audio Retrieval (SAR)
Slaney’s SAR system is the only existing audio annotation and
retrieval system [Sla02a, Sla02b, Buc05].
1. Learn separate hierarchical models in each space
–
Semantic space
1. cluster documents
2. represent each cluster with a multinomial distribution [BF01].
–
Audio space
1. learn GMM for each ‘anchor’ audio file
2. create distance matrix: -[LA(B) + LB(A)]/2 for points A & B
3. agglomerative clustering based distance to all anchor points
2. Create separate linkages between spaces
–
–
Annotation (audio-to-text linkage)
Retrieval (text-to-audio linkage)
9
Semantic-Audio Retrieval (SAR)
Annotation
1.
2.
3.
4.
Evaluate query song under each nodes (GMM) in acoustic space.
Identify documents associated the node with the highest likelihood
Learn a multinomial distribution from those documents
Generate an annotation
10
Semantic-Audio Retrieval (SAR)
Retrieval
1. Learn a GMM for audio associated with each node of the semantic
hierarchy
2. Evaluate the query text under each node in semantic space
3. Estimate a GMM from audio files associated with the highest
likelihood node
4. Retrieve audio files that have high likelihood under this GMM
11
Semantic-Audio Retrieval (SAR)
Comments:
1. Subsequent work includes
–
–
Mixture of Expert formulation [Sla02b]
Alternative text and audio representations [Buc05]
2. Hierarchy for semantic concepts may be restrictive
–
i.e. top-level concept? instrumentation or genre
3. Inference is computationally expensive
–
–
Annotation: evaluation of query song under each anchor model
Retrieval: evaluation of training set under learned query model
4. Few quantitative results shown
12
Automatic Record Review
Whitman and Ellis model heterogeneous data of
audio and text to generate ‘unbiased’ record
reviews [WE04].
– A classifier (SVM) is learned for each word in their vocabulary.
– ‘Grounded’ words produce classifiers that can separate the musical
audio content.
– Sentences from existing review with many ‘ground’ terms are
retained, while sentences with ‘biased’ terms are deleted.
Comments:
• Creative solution for vocabulary selection
• Retrieval is discussed but never implemented
13
Related Research Areas
• Semantic Multimedia Retrieval
– Image/Video annotation and retrieval models
– System evaluation
• Music Information Retrieval
– Query-by-example - acoustic similarity
– Music classification - genre, instrument, emotion
• Digital Signal Processing
– Audio feature extraction
• i.e. Mel-frequency cepstral coefficients
– Audio representation
• i.e. Gaussian mixture models (GMM), hidden Markov models (HMM)
14
Related Research Areas
•
Computer Vision
–
Feature design and selection
•
•
•
–
•
Joint semantic audio and visual models for video data
Multiple Instance Learning
–
–
•
Boosting
Interest point detection
Automatic segmentation
Modeling when each data point is represented as a bag of features
Some semantic concepts apply to just a few of the features
Multiple View Learning
–
–
Incorporating additional sources of information
i.e. Music domain:
•
•
•
Multiple song or album reviews
Musical playlists and sales records
Song lyrics
15
Outline
•
•
•
•
Related work
“Modeling Music with Words”
Future research directions
Thesis logistics
16
Modeling Music with Words
• Joint work with Luke Barrington and Gert Lanckriet
• We developed the first automatic music annotation and
retrieval system
• Supervised multiple class naïve Bayes approach
– Each word in our vocabulary is a ‘class’
– Developed by Carneiro and Vasconcelos [CV05] for image annotation as
a reaction against supervised one-versus-all and unsupervised models
[BDF+02, BJ03,FLM04]
+ Scalable in database size
+ Fewer demands on quality of labeling (e.g. weakly labeled data)
+ Produces a natural ranking of semantic concepts
+ Explicitly models semantics of the problem
17
System Overview
Parametric Model
Training Data
T
T
Review
Text-Feature
Extraction
Audio-Feature
Extraction
Parameter
Estimation
Novel Song
Evaluation
(annotation)
Review
Inference
Query
(retrieval)
18
System Overview
Vocabulary: M semantic tokens (referred to as ‘words’)
–
–
Each semantic token is a ‘musically informative’ unigram or bigram
i.e. ‘rock’, ‘romantic’, ‘bob;dylan’, ‘electric;guitar’
Text representation: each document is represented as a binary document
vector y= (y1,…,yM) where yi is 1 if word i is present, and 0 otherwise.
Audio representation: each audio track X = {x1,…,xT} is represented by a
bag of T real-valued feature vectors, where T depends on track length.
–
–
Music vectors are extracted every 3/4 second [MB03]
1. dynamic Mel-frequency cepstral coefficient (DMFCC)
2. auditory filterbank temporal envelopes (AFTE)
Sound effect vectors are extracted every 10 msec [Buc05]
1. delta cepstum features based on MFCC + derivatives
Heterogeneous data set: track-document pairs - {(X1, y1), … , (XD, yD)}
19
Multi-class naïve Bayes model
We learn a class conditional distribution P(x|i) for each word i in
our vocabulary.
•
Each ‘word-level’ distribution is modeled with a Gaussian
mixture model (GMM).
•
The training data is the set of tracks that have word i in the
associated text document.
•
The parameters of the GMM are learned using the
expectation-maximization (EM) algorithm.
– Direct estimation
– Naïve Averaging estimation using ‘track-level’ models
– ‘Hierarchy of mixtures’ estimation using ‘track-level’ models [Vas01]
20
Annotation
Given our densities p(x|i) and a query track (x1,…,xT), we
select word i* by
If we assume xi and xj are conditionally independent, then
Assuming a uniform word prior and taking a log transform, we
have
We compute average log likelihood (dividing by T) since longer
songs will have disproportionately low likelihoods [RQD00].
21
Retrieval
We would like to rank test songs by the posterior
probability P(x1, …,xt|q) given a query word q.
However, in practice, this results in almost the same
ranking for all query words.
There are two reasons:
1. Length Bias
•
•
•
Longer songs will have proportionatelylower likelihood resulting from
the sum of additional log terms.
This result from the poor conditional independence assumption
between audio feature vectors [RQD00].
Solution: we compute the average posterior probability for each song.
22
Retrieval
We would like to rank test songs by the posterior
probability P(x1, …,xt|q) given a query word q.
However, in practice, this results in almost the same
ranking for all query words.
There are two reasons:
1. Length Bias
2. Song Bias
1. Many conditional word distribution P(x|q) are similar to the generic
song distribution P(x)
2. High probability (e.g. generic songs) under p(x) will often have high
probability under p(x|q)
23
Retrieval
Instead, we normalize by the song bias P(x1,…,xT), and thus
rank songs by the likelihood p(q|x1,…,xT):
Normalizing by p(x) allows each song to place emphasis (e.g.
weight) on words that increase the probability of x.
24
Experimental Setup
Data: 2131 Song-review pairs
–
Audio: mainstream western music from the last 60 years
•
–
DMFCC and AFTE features
Text: expert song reviews from AMG Allmusic database
•
Vocabulary of 317 unigrams and bigrams
Model: Supervised multi-class naïve Bayes
–
Direct Estimation and Naïve averaging for word densities
Tasks:
Annotation: annotate each test song with 10 words
Retrieval: rank order all test songs given a query word
25
Qualitative Annotation Results
26
Qualitative Retrieval Results
27
Evaluation Metrics
Annotation: mean per-word recall and precision
For each word w, let
|wH| = # of human annotations with word w
|wA| = # of automatic annotations with word w
|wC| = # of correct automatic annotations
Per-word Recall
Per-word Precision
= |wC| / |wH|
= |wC| / |wA |
Mean recall and precision: average over all words in the vocabulary
Precision ranges between 0.0 (bad) and 1.0 (good).
Recall ranges between 0.0 (bad) and 0.84 (best possible) since we
annotate with far fewer words than are present in the test set corpus.
•
•
The test set contains 8232 words
Our system outputs 4250 words
28
Evaluation Metrics
Retrieval: mean AP and AROC
1. Average precision (AP):
• iterate through ranking
• average the precisions at each point when we correctly identify a
new song.
• values between 0.0 (bad) and 1.0 (perfect)
2. Area under the ROC curve (AROC)
• ROC function - true positive rate vs. false positive rate
• Area under ROC - integrate ROC function as we iterating through
ranking
• Values between 0.0 (bad) and 1.0 (perfect)
• Random model will give you an AROC = 0.5
Mean AP and AROC is the average over all words in the
vocabulary
29
Quantitative System Evaluation
Tasks:
Annotation: annotate each test song with 10 words
Retrieval: rank order all test songs given a query word
Evaluation metrics:
Annotation: mean per-word recall and precision
Retrieval: mean AP and AROC
Baseline models:
1. Random words - uniform distribution over words
2. Prior (stochastic) - pick words from a multinomial distribution
parameterized by prior probability P(i) of word i
3. Prior (deterministic) - rank words according to P(i)
30
Quantitative System Evaluation
Annotation: system produces significantly better results than random (3x
recall, 2x precision).
– DMFCC feature produce superior results
– Direct and naïve average estimation produce comparable results
– Best recall (0.09) and precision (0.12) leaves room for improvement
31
Quantitative System Evaluation
Retrieval: system produces significantly better results than random
– DMFCC feature produce superior results
– Direct and naïve average estimation produce comparable results
– Best AP (0.11) and AROC (0.61) leave room for improvement
32
Comments on results
Our results leave much to be desired, but:
– Ground truth is noisy
• Authors do not make an explicit list of words when reviewing a song
• Companies (Microsoft, Moodlogic) have collected clean music data
– State-of-the-art image annotation and vision systems
have precision and recall of about 0.25.
• Directly comparing image and audio systems is dangerous since relative
objectivity of the tasks and the quality of the data affect performance.
– Early sound effects result show promise
•
•
•
•
Recall
Precision
Mean AP
Area ROC
= 0.16 (random 0.01)
= 0.10 (random 0.02)
= 0.15 (random 0.05)
= 0.75 (random 0.50)
33
Outline
•
•
•
•
Related work
“Modeling Music with Words”
Future research directions
Thesis logistics
34
1: Implementing unsupervised models
Recent unsupervised models have be proposed for
modeling semantics [BJ03, FLM04, LW03, BDF+02]
– Introduce a ‘latent’ variable that encodes a set of states
– Each state represents a joint distribution between
content-based features and semantic concepts.
Two recent models are:
1. Correspondence Latent Dirichlet Allocation [BJ03]
•
•
Latent states: hidden topics that are learned during training
Parameter estimation involves variational approximation or MCMC
sampling
2. Multiple Bernoulli Relevance Model [FML04]
•
•
Latent states: each training images
This parameter estimation reduces to counting and clever smoothing
35
2: Audio Representation
Modeling longer-term temporal aspects of audio
– GMM formulation represents audio as a bag of vectors
– HMM and conditional random fields (CRF) can be used to model
trajectories of features over time
– Feature integration using signal processing techniques (i.e.
modulation spectra [SAP04])
Incorporating automatic audio segmentation [Got03]
– Represent homogeneous segments of audio content
Adapting feature selection algorithms used in vision research
1. Boosting [VJ02, OFPA06]
2. Interest point detection [MS02, AR05]
36
Other Directions
3: Text Processing
Incorporate parts of speech tagging, model synonym
relationships, learn dependencies between
semantic tokens [TH02].
4: Multiple Instance Learning [DLLP97, RC05]
Using average log likelihood for annotation is
problematic if some semantic concepts are related
to just a few feature vectors.
We could use alternatives, such as minimum or
maximum log likelihood.
37
Other Directions
5: Multiple View Learning [RS05]
Each of our heterogeneous data points may have many sources of
information:
• Alternative song or album reviews
• Lyrics
• Playlist or sales records information
Multiple view learning involves combining these additional sources of
information.
6: Modeling with Clean Annotations
Our noisy dataset degrades our performance. Working with companies that
have collected cleaner data sets can lead to mutually beneficial
collaboration.
–
–
Microsoft - paid experts produce high quality manual annotations
Moodlogic - large user base fills out standard annotation forms
38
Outline
•
•
•
•
Related work
“Modeling Music with Words”
Future research directions
Thesis logistics
39
Prioritized Work Schedule
1. Unsupervised Models: CorrLDA [BJ03]
2. Audio Representation
1.
2.
3.
4.
HMM [FPW05]
Feature selection using boosting [VJ02]
Automatic segmentation [Got03]
Alternative extraction and integration techniques
3. Multiple instance learning
4. Using clean annotations
5. Language modeling
Note: Bold represents current or ‘near future’ research.
40
Thesis Chapters
1. Introduction to modeling the semantics of audio data
2. Related work
•
•
•
Semantic modeling of multimedia data
Audio feature design
Text processing
3. Audio and text representation
4. Modeling using supervised multi-class approach
5. Second technical contribution
1. Unsupervised approach, or
2. Multiple instance learning, or
3. Others (MVL, similarity with semantics)…
6. Conclusions, discussions, and future work
Note: Bold represents chapters with novel technical material.
41
Schedule
Spring 06
– ISMIR paper “Modeling Music with Words”
– Thesis proposal
– NIPS paper “Modeling the Semantics of Sound”
Summer 06
– NSF EAPSI research fellowship with Dr Masataka Goto at the Japanese
National Institute of Advanced Industrial Science and Technology (AIST)
Fall, Winter, Spring 06-07
– Research at UCSD
– Potential papers at ICASSP, ICML, NIPS, ISMIR, SIGIR
Summer 07
– Research outside of UCSD - Columbia (Dan Ellis), U. of Victoria (George
Tzanetakis), Microsoft (John Platt), OFAI (Gerhard Widmer), IRCAM,
Moodlogic, etc.
Fall, Winter 07
– Prepare and defend thesis
– Apply for post doctoral, research lab, and/or teaching positions
Spring 07
– Submit thesis, continue job search, finish outstanding papers
42
The beginning…
Image from www.therecordcollector.org
43
References
44
References
45
Correspondence Latent Dirichlet Allocation
Corr-LDA is a popular latent model develop for image annotation [BJ03]
Each image is an (r,w) pair
• r is a vector of N image region feature vectors
• w is a vector of M keywords from an image caption vocabulary of W keywords
46
Correspondence Latent Dirichlet Allocation
The generative process for an image and image annotation according to the
Corr-LDA model is
Draw θ from a Dirichlet distribution with parameter α
1.
•
2.
3.
θ is a multinomial random variable and can be thought of as a “distribution over topics.”
For each of the N image regions rn, draw a topic zn. Draw an image region from
the Gaussian distribution associated with topic zn.
For each of the M keywords wm, pick one of the topics that was chosen in step 2.
Draw a keyword from the multinomial distribution associated with this topic.
47
48
Download