Zhiyao Duan , Lie Lu , and Changshui Zhang

advertisement
Collective Annotation of Music from Multiple Semantic Categories
Zhiyao
1,2
Duan ,
Lie
1. Microsoft Research Asia (MSRA), Beijing, China.
Summary
 Two collective semantic annotation methods of music, modeling
not only individual labels , but also label correlations.
 50 musically relevant labels are manually selected for music
annotation, covering 10 aspects of music perception.
 Normalized mutual information is employed to measure the
correlation between two semantic labels.
 Label pairs with strong correlation are selected and modeled.
 Generative: Gaussian Mixture Model (GMM)-based method
 Discriminative: Conditional Random Field (CRF)-based method
1
Lu ,
and Changshui
2
Zhang
2. Department of Automation, Tsinghua University, Beijing, China.
Properties:
 0 <= NormMI(X; Y) <= 1;
 NormMI(X; Y) = 0 when X and Y is statistically independent;
 NormMI(X; X) = 1.
single label posterior
label pair posterior
Table 2
Selected pairs
 Experimental results show slight but consistent improvements
compared with individual annotation methods.
Motivation
 Semantic annotation of music is an important research direction.
 Semantic labels (text, words) is a more compact and efficient
representation than raw audio or low-level features.
 Potentially facilitates applications, e.g. music retrieval and
recommendation.
 Disadvantages of previous methods:
 Vocabulary without structured labels -> annotation without sufficient
musical aspects.
 Model audio-label relations only, without label-label relations.
•
E.g. “hard rock” & “electronic guitar”, “happy” & “minor key”
 Therefore, we divide the semantic vocabulary into categories, and
attempt to model label correlations.
Semantic Vocabulary
1. Consists of 50 labels, manually selected from web-parsed
musically relevant words
2. 10 semantic categories (aspects)
3. A label number limitation in each category for annotation
(4)
5. Only the label pairs whose NormMI values are larger than a
threshold are selected to be modeled.
Audio Feature Extraction
A bag of beat-level feature vectors are used to represent a song:
1. Each song is divided into beat segments.
2. Each segment contains a number of frames of 20ms length and
10ms overlap.
3. Timbre features (94-d) and rhythm features (8-d) are extracted to
compose a 102-d feature vector in each segment.
4. PCA to reduce the dimensionality to 65, reserving 95% energy.
 Timbre features: means and standard deviations of 8-order
MFCCs, spectral shape features and spectral contrast features
 Rhythm features: average tempo, average onset frequency,
rhythm regularity, rhythm contrast, rhythm strength, average drum
frequency, amplitude and confidence [1]
Semantic Annotation
Previous methods: labels are treated independent.
 Individual GMM-based method:
(2)
4. Normalized Mutual Information (NormMI) is used to measure the
correlation of each label pair.
(1)
: entropy of X,
TEMPLATE DESIGN © 2008
www.PosterPresentations.com
: mutual information between X and Y.
where
single label posterior
where is the set of selected label pairs;
and are labels of a
pair; is a trade-off between label posterior and label pair posterior.
The Likelihood
and
kernel GMM from training data.
Problem: find some semantic words to describe a song. It can be
viewed as a multi-label binary classification problem.
Input: a vocabulary consisting of
labels (or words)
;
a bag of feature vectors of a song
.
Output: an annotation vector
,
where is a binary variable of , 1: presence, -1: absence.
Solution: Maximum A Posterior (MAP)
Table 1
Vocabulary
Results:
 Per category performance: the performance for each category
Proposed methods: consider the relations between labels.
1) Collective GMM-based method: approximates the posterior
(3)
The likelihood
can be estimated using GMM from training
data. The prior probability
can be set to a uniform distribution.
are estimated using a 8 Per song performance: the average performance for a song
2) Collective CRF-based method:
Conditional Random Field (CRF): an undirected graphical model,
nodes: label variables; edges: relations between labels.
Multi-label classification CRF model [2]:
Table 3
1. CRF-based methods outperform GMM-based methods;
2. Collective annotation methods slightly but consistently improve the
performance of their individual counterpart, both for GMM-based
and CRF-based.
Table 4
overall potential of edges
(5)
where
overall potential of nodes
: a sample (a song), represented by an input feature vector;
: an output label vector;
: the normalizing factor.
&
: features of CRF, predefined real-value functions.
& : parameters to be estimated using training data.
Note: Different from the GMM-based method, “bag of features”
cannot be used here; instead, each song is represented by a
115-d feature vector.
115-d = 65-d (mean of beat-level features) + 50-d (word likelihoods)
Experiments
Data set:
~5,000 Western popular songs;
Manually annotated with semantic labels from the vocabulary in
Table 1, according to the label number limitations;
25% for training, 75% for testing;
49 label pairs are selected to model, whose NormMI > 0.1.
Compared Methods:
1.Collective GMM-based method
2.Individual GMM-based method
3.Collective CRF-based method
4.Individual CRF-based method : use the CRF framework in Eq. (5)
without considering the “overall potential of edges”.
1. While the recalls are similar, the precision is improved significantly
from the generative models to discriminative models;
2. The collective methods slightly outperform their individual
counterparts.
Open question:
 The performance improvements from individual modeling to
collective modeling is not so much.
Possible reason:
In individual modeling methods, the labels which are “correlated”
share many songs in their training set (since each song has multiple
labels). This makes the trained models of “correlated” labels are also
“correlated”, or in other words, the correlation is implicitly modeled.
Future Work
1. Further exploit better methods to model label correlations.
2. Exploit better features, especially the song-level feature vector for
CRF-based methods.
3. Try to apply the obtained annotations in various applications, such
as music similarity measure, music search and recommendation.
References
[1] Lu, L., Liu, D. and Zhang, H.J. ”Automatic mood detection and
tracking of music audio signals”, IEEE Trans. on Audio, Speech and
Lang. Process., vol. 14, no. 1, pp. 5-18, 2006.
[2] Ghamrawi, N. and McCallum, A. “Collective multilabel
classification,” in Proc. the 14th ACM International Conference on
Information and Knowledge Management (CIKM), 2005, pp. 195-200.
Download