Collective Annotation of Music from Multiple Semantic Categories Zhiyao 1,2 Duan , Lie 1. Microsoft Research Asia (MSRA), Beijing, China. Summary Two collective semantic annotation methods of music, modeling not only individual labels , but also label correlations. 50 musically relevant labels are manually selected for music annotation, covering 10 aspects of music perception. Normalized mutual information is employed to measure the correlation between two semantic labels. Label pairs with strong correlation are selected and modeled. Generative: Gaussian Mixture Model (GMM)-based method Discriminative: Conditional Random Field (CRF)-based method 1 Lu , and Changshui 2 Zhang 2. Department of Automation, Tsinghua University, Beijing, China. Properties: 0 <= NormMI(X; Y) <= 1; NormMI(X; Y) = 0 when X and Y is statistically independent; NormMI(X; X) = 1. single label posterior label pair posterior Table 2 Selected pairs Experimental results show slight but consistent improvements compared with individual annotation methods. Motivation Semantic annotation of music is an important research direction. Semantic labels (text, words) is a more compact and efficient representation than raw audio or low-level features. Potentially facilitates applications, e.g. music retrieval and recommendation. Disadvantages of previous methods: Vocabulary without structured labels -> annotation without sufficient musical aspects. Model audio-label relations only, without label-label relations. • E.g. “hard rock” & “electronic guitar”, “happy” & “minor key” Therefore, we divide the semantic vocabulary into categories, and attempt to model label correlations. Semantic Vocabulary 1. Consists of 50 labels, manually selected from web-parsed musically relevant words 2. 10 semantic categories (aspects) 3. A label number limitation in each category for annotation (4) 5. Only the label pairs whose NormMI values are larger than a threshold are selected to be modeled. Audio Feature Extraction A bag of beat-level feature vectors are used to represent a song: 1. Each song is divided into beat segments. 2. Each segment contains a number of frames of 20ms length and 10ms overlap. 3. Timbre features (94-d) and rhythm features (8-d) are extracted to compose a 102-d feature vector in each segment. 4. PCA to reduce the dimensionality to 65, reserving 95% energy. Timbre features: means and standard deviations of 8-order MFCCs, spectral shape features and spectral contrast features Rhythm features: average tempo, average onset frequency, rhythm regularity, rhythm contrast, rhythm strength, average drum frequency, amplitude and confidence [1] Semantic Annotation Previous methods: labels are treated independent. Individual GMM-based method: (2) 4. Normalized Mutual Information (NormMI) is used to measure the correlation of each label pair. (1) : entropy of X, TEMPLATE DESIGN © 2008 www.PosterPresentations.com : mutual information between X and Y. where single label posterior where is the set of selected label pairs; and are labels of a pair; is a trade-off between label posterior and label pair posterior. The Likelihood and kernel GMM from training data. Problem: find some semantic words to describe a song. It can be viewed as a multi-label binary classification problem. Input: a vocabulary consisting of labels (or words) ; a bag of feature vectors of a song . Output: an annotation vector , where is a binary variable of , 1: presence, -1: absence. Solution: Maximum A Posterior (MAP) Table 1 Vocabulary Results: Per category performance: the performance for each category Proposed methods: consider the relations between labels. 1) Collective GMM-based method: approximates the posterior (3) The likelihood can be estimated using GMM from training data. The prior probability can be set to a uniform distribution. are estimated using a 8 Per song performance: the average performance for a song 2) Collective CRF-based method: Conditional Random Field (CRF): an undirected graphical model, nodes: label variables; edges: relations between labels. Multi-label classification CRF model [2]: Table 3 1. CRF-based methods outperform GMM-based methods; 2. Collective annotation methods slightly but consistently improve the performance of their individual counterpart, both for GMM-based and CRF-based. Table 4 overall potential of edges (5) where overall potential of nodes : a sample (a song), represented by an input feature vector; : an output label vector; : the normalizing factor. & : features of CRF, predefined real-value functions. & : parameters to be estimated using training data. Note: Different from the GMM-based method, “bag of features” cannot be used here; instead, each song is represented by a 115-d feature vector. 115-d = 65-d (mean of beat-level features) + 50-d (word likelihoods) Experiments Data set: ~5,000 Western popular songs; Manually annotated with semantic labels from the vocabulary in Table 1, according to the label number limitations; 25% for training, 75% for testing; 49 label pairs are selected to model, whose NormMI > 0.1. Compared Methods: 1.Collective GMM-based method 2.Individual GMM-based method 3.Collective CRF-based method 4.Individual CRF-based method : use the CRF framework in Eq. (5) without considering the “overall potential of edges”. 1. While the recalls are similar, the precision is improved significantly from the generative models to discriminative models; 2. The collective methods slightly outperform their individual counterparts. Open question: The performance improvements from individual modeling to collective modeling is not so much. Possible reason: In individual modeling methods, the labels which are “correlated” share many songs in their training set (since each song has multiple labels). This makes the trained models of “correlated” labels are also “correlated”, or in other words, the correlation is implicitly modeled. Future Work 1. Further exploit better methods to model label correlations. 2. Exploit better features, especially the song-level feature vector for CRF-based methods. 3. Try to apply the obtained annotations in various applications, such as music similarity measure, music search and recommendation. References [1] Lu, L., Liu, D. and Zhang, H.J. ”Automatic mood detection and tracking of music audio signals”, IEEE Trans. on Audio, Speech and Lang. Process., vol. 14, no. 1, pp. 5-18, 2006. [2] Ghamrawi, N. and McCallum, A. “Collective multilabel classification,” in Proc. the 14th ACM International Conference on Information and Knowledge Management (CIKM), 2005, pp. 195-200.