Smith 1 Jordan Smith MUMT 611: Music Information Acquisition, Preservation, and Retrieval Professor Ichiro Fujinaga 30 March 2008 Text Categorization and the Analysis of Lyrics 1. Introduction The Music Information Retrieval (MIR) and Text Categorization (TC) communities are closely related: they research similar problems, such as automatic classification and similarity estimation; and they use similar techniques to solve them—mainly machine learning (ML) techniques (Sebastiani 2002). They have a shared history, too: in the 1990s, as computing power increased and a growing number of documents (both music and text) became available in digital form, MIR and TC research expanded to address the need to handle these vast quantities of data. The Music Information Retrieval Evaluation eXchange (MIREX) even took its text counterpart—the Text REcognition Conference (TREC)—as its pattern (Pienimki 2006). Ironically, despite their shared concerns and techniques, one topic that lies at the intersection of these fields has remained largely unexplored: lyrics (Maxwell 2007). This is additionally surprising given the finding that close to 30% of MIR queries use lyrics data (Bainbridge et al. 2003). Compared to audio files, lyrics can be extremely easy to collect: several studies have established tools for the automated collection and cross-referencing of lyric data from online sources (Deleijnse and Korst 2006, Knees et al. 2005). Unlike most audio data, lyrics are also compact, and may be collected and distributed freely and legally. Lyrics are also very reliable ground truth: they are usually a highly accurate transcription of what is uttered in a song, while a MIDI file may be a poor representation of what is played in a song (Logan et al. 2004). Lyrics thus represent a rich and accessible source of data that ought to be studied with some combination of MIR and TC techniques. This paper provides an overview of text categorization (presuming a knowledge of machine learning). The next section begins with a definition of the field, followed by a discussion of various indexing and feature reduction techniques. Section 3 summarizes some previous applications of TC, including studies of lyrics. Section 4 summarizes and concludes the paper. 2. Text categorization Text categorization is the task of assigning text documents within a corpus to one or more category labels. In formal terms, for each pair (dj, ci) of a document dj and category ci, we must assign some Boolean value to signify membership or non-membership (or else, a confidence value of membership) (Sebastiani 2002). To apply machine learning to this problem requires three steps: first, it is necessary to give the document dj some intermediate representation, typically a vector of feature weights dj = {wj1, wj2, …, wj|F|} where each weight wji estimates the importance of some element in a set F of features. Secondly, a classifier is trained on the transformed data, and thirdly the classifier is evaluated. Techniques used for these latter steps are not special to TC, but techniques for feature representation are. This section, which closely follows (Sebastiani 2002), first discusses different methods of turning a document into a feature Smith 2 vector (a process called “indexing”), and then different methods of eliminating extraneous or unhelpful features. 2.1 Indexing techniques The simplest method of indexing is the “bag of words” approach, in which each feature is associated with a different word in the corpus. The weighting can be binary (1 if the word is present in the document, and 0 otherwise), or a weighting function can be used to express how common the word is in the corpus versus how common it is in the document. The most common function is called the text frequency/inverse document frequency (tf-idf), and is defined as follows: tf-idf(fk, dj) = #(fk, dj) · log [ |D| / #D(fk) ] where #(fk, dj) is the frequency of the feature fk in the document dj, and #D(fk) is the number of documents in the corpus D that include fk. Instead of words, the features fk may be associated with phrases, either taking n-grams (strings of n words) or using a grammatically-informed definition of phrase. However, using phrases as features has so far led to mixed, unpromising results (Sebastiani 2002). Outside the words themselves, other aspects of the document may be measured and used to train a classifier. Such meta-features include document length, the average length of words, the generality of the categories on the training set, and so forth. Knowledge of phonetics and grammatical structure also permits advanced features to be developed for specific classification tasks: for instance, an estimate of the frequency of rhymes and of metric regularity, for detection of poetry (Tiznoosh and Dara 2006); and the frequency of declarative or interrogative sentence structures, for style recognition (Uzuner and Katz 2005). 2.2 Feature reduction techniques In a modest-sized corpus the number of features can easily climb into the tens of thousands; the full text of Hamlet alone contains over 4000 unique words. For many classifiers, this is prohibitively many features to train on. Fortunately, linguistics and statistics provide methods of quickly eliminating those features that are uninteresting or that provide no classification value. One general strategy is to ignore those words that occur extremely frequently or extremely infrequently, since these will have little value in discriminating between documents. Removing those words that occur just once or twice in a document (or that occur in only one or two documents), as many as 90% of the features may be eliminated with no decrease in classifier effectiveness. In addition, a list of stop words may be imposed, eliminating function words such as ‘the’ and ‘to’ which are believed to contribute no meaning to a document (although these may be relevant for a task such as style detection). Using the training data, more advanced feature reduction is possible. For instance, using a chi-square test or some other statistical measure, a word may be found to be frequent in the training set but independent from the category labels. Such a word would have little discriminative value and could be ignored. Latent Semantic Analysis (LSA) is a technique that aims to account for synonymous words. It involves projecting the feature set onto a new set of “topics” (ideally, mapping synonyms into the same topic) in order to maximize classification ability. Arbitrarily few topics can be chosen: a recent study on the lyrics of hit songs used just 8 topics (Dhanaraj and Logan 2005). Smith 3 3. Text categorization applications Development originally began on TC to address the task of automatic indexing: given a set of subject words, the task is to tag each document with those subjects that are relevant (Sebastiani 2002). In the 1990s, the CONSTRUE TC system was developed for the news agency Reuters to automatically classify news stories. And today, we rely on TC to filter spam from authentic email. The feature extraction techniques described in section 2 have been applied to MIR in order to analyze lyrics. Analogous to classifying news stories by topic, various studies have attempted to classify music by genre. One found that while a genre classifier trained on lyric data performed worse than a classifier trained on audio data, they made different mistakes, suggesting that lyric and audio data could be profitably combined (Logan et al. 2004). In a later study on predicting hit songs, lyrics were found to provide more discriminating data than audio (Dhanaraj and Logan 2005). Another recent study has indicated that a range of lyric features—relating to grammar, word usage, sentiment, and repetition—may be combined to distinguish between rock, pop, indie, and related genres that are notoriously difficult to separate using only audio data (Maxwell 2007). Lyric and audio data were directly combined in a study on lyric recognition from a singing voice (Hosoya et al. 2005). Identifying consonants directly from audio can be too difficult, so in this study only formants were estimated. Vowel patterns were then matched to words; the task is made easier than it sounds since the only possible matching strings are those in the database of lyrics. The retrieval task was further simplified by assuming the singer would only begin at or near the beginning of a phrase (Suzuki et al. 2007). One problem faced when applying TC to lyrics is that TC techniques often rely on large sample sets. But where a single text document may contain several thousand words, pop songs may average only 200 words, limiting the effectiveness of the “bag of words” approach (Maxwell 2007). New TC techniques are thus currently being developed to tackle lyrics more effectively. Since a corpus of lyrics can be insufficient to teach the actual set of semantic relationships that exist in English, one study trained their feature extractor on a thesaurus before applying it to the lyrics, improving their results (Wei et al. 2007). 4. Conclusion This review has summarized those aspects of TC relevant to the MIR community, describing the main methods of extracting and selecting features. Significant applications of TC were named, followed by a review of current research applying TC to MIR tasks. Overall, analyses of lyrics have proved very promising. However, current techniques have encountered challenging obstacles and future research will likely need to develop a new stable of TC techniques to apply to lyrics. Smith 4 References Bainbridge, D., S. Cunningham, and J. Downie. 2003. Analysis of queries to a Wizard-of-Oz MIR system: Challenging assumptions about what people really want. Proceedings of IEEE International Conference on Multimedia and Expo. Dhanaraj, R., and B. Logan. 2005. Automatic prediction of hit songs. International Conference on Music Information Retrieval, London, UK. 488–91. Geleijnse, G., and J. Korst. 2006. Efficient lyrics extraction from the web. International Conference on Music Information Retrieval, Victoria, Canada. 371–2. Hosoya, T., M. Suzuki, A. Ito, and S. Makino. 2005. Lyrics recognition from a singing voice based on finite state automaton for music information retrieval. International Conference on Music Information Retrieval, London, UK. 532–5. Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Proceedings of IEEE International Conference on Multimedia and Expo. 1–7. Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous features. M.Sc. Thesis. University of Edinburgh. Pienimki, A. 2006. Organised evaluation in (music) information retrieval: TREC and MIREX. Paper presented to the seminar Information Retrieval Research, Nov. 16, University of Helsinki. <http://www.cs.helsinki.fi/u/linden/teaching/irr06/papers/ap_irr06_final.pdf> Sebastiani, F. 2002. Machine learning in automated text categorization. Technical Report, Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59. Suzuki, M., T. Hosoya, A. Ito, and S. Makino. 2007. Music information retrieval from a singing voice using lyrics and melody information. EURASIP Journal on Advances in Signal Processing. Volume 2007, Article ID 38727. Tizhoosh, H., and R. Dara. 2006. On poem recognition. Pattern Analysis and Applications. 9(4):325–38. Uzuner, Ö., and B. Katz. 2005. Style versus expression in literary narratives. Proceedings of the 28th Annual International ACM SIGIR Conference. Salvador, Brazil. Wei, B., C. Zhang, and M. Ogihara. 2007. Keyword generation for lyrics. International Conference on Music Information Retrieval, Vienna, Austria.121–2.