Text categorization and analysis of lyrics

advertisement
Smith 1
Jordan Smith
MUMT 611: Music Information Acquisition, Preservation, and Retrieval
Professor Ichiro Fujinaga
30 March 2008
Text Categorization and the Analysis of Lyrics
1. Introduction
The Music Information Retrieval (MIR) and Text Categorization (TC) communities are
closely related: they research similar problems, such as automatic classification and similarity
estimation; and they use similar techniques to solve them—mainly machine learning (ML)
techniques (Sebastiani 2002). They have a shared history, too: in the 1990s, as computing power
increased and a growing number of documents (both music and text) became available in digital
form, MIR and TC research expanded to address the need to handle these vast quantities of data.
The Music Information Retrieval Evaluation eXchange (MIREX) even took its text
counterpart—the Text REcognition Conference (TREC)—as its pattern (Pienimki 2006).
Ironically, despite their shared concerns and techniques, one topic that lies at the intersection of
these fields has remained largely unexplored: lyrics (Maxwell 2007).
This is additionally surprising given the finding that close to 30% of MIR queries use
lyrics data (Bainbridge et al. 2003). Compared to audio files, lyrics can be extremely easy to
collect: several studies have established tools for the automated collection and cross-referencing
of lyric data from online sources (Deleijnse and Korst 2006, Knees et al. 2005). Unlike most
audio data, lyrics are also compact, and may be collected and distributed freely and legally.
Lyrics are also very reliable ground truth: they are usually a highly accurate transcription of what
is uttered in a song, while a MIDI file may be a poor representation of what is played in a song
(Logan et al. 2004). Lyrics thus represent a rich and accessible source of data that ought to be
studied with some combination of MIR and TC techniques.
This paper provides an overview of text categorization (presuming a knowledge of
machine learning). The next section begins with a definition of the field, followed by a
discussion of various indexing and feature reduction techniques. Section 3 summarizes some
previous applications of TC, including studies of lyrics. Section 4 summarizes and concludes the
paper.
2. Text categorization
Text categorization is the task of assigning text documents within a corpus to one or more
category labels. In formal terms, for each pair (dj, ci) of a document dj and category ci, we must
assign some Boolean value to signify membership or non-membership (or else, a confidence
value of membership) (Sebastiani 2002). To apply machine learning to this problem requires
three steps: first, it is necessary to give the document dj some intermediate representation,
typically a vector of feature weights dj = {wj1, wj2, …, wj|F|} where each weight wji estimates the
importance of some element in a set F of features. Secondly, a classifier is trained on the
transformed data, and thirdly the classifier is evaluated. Techniques used for these latter steps are
not special to TC, but techniques for feature representation are. This section, which closely
follows (Sebastiani 2002), first discusses different methods of turning a document into a feature
Smith 2
vector (a process called “indexing”), and then different methods of eliminating extraneous or
unhelpful features.
2.1 Indexing techniques
The simplest method of indexing is the “bag of words” approach, in which each feature is
associated with a different word in the corpus. The weighting can be binary (1 if the word is
present in the document, and 0 otherwise), or a weighting function can be used to express how
common the word is in the corpus versus how common it is in the document. The most common
function is called the text frequency/inverse document frequency (tf-idf), and is defined as
follows:
tf-idf(fk, dj) = #(fk, dj) · log [ |D| / #D(fk) ]
where #(fk, dj) is the frequency of the feature fk in the document dj, and #D(fk) is the number of
documents in the corpus D that include fk.
Instead of words, the features fk may be associated with phrases, either taking n-grams
(strings of n words) or using a grammatically-informed definition of phrase. However, using
phrases as features has so far led to mixed, unpromising results (Sebastiani 2002).
Outside the words themselves, other aspects of the document may be measured and used
to train a classifier. Such meta-features include document length, the average length of words,
the generality of the categories on the training set, and so forth. Knowledge of phonetics and
grammatical structure also permits advanced features to be developed for specific classification
tasks: for instance, an estimate of the frequency of rhymes and of metric regularity, for detection
of poetry (Tiznoosh and Dara 2006); and the frequency of declarative or interrogative sentence
structures, for style recognition (Uzuner and Katz 2005).
2.2 Feature reduction techniques
In a modest-sized corpus the number of features can easily climb into the tens of
thousands; the full text of Hamlet alone contains over 4000 unique words. For many classifiers,
this is prohibitively many features to train on. Fortunately, linguistics and statistics provide
methods of quickly eliminating those features that are uninteresting or that provide no
classification value.
One general strategy is to ignore those words that occur extremely frequently or
extremely infrequently, since these will have little value in discriminating between documents.
Removing those words that occur just once or twice in a document (or that occur in only one or
two documents), as many as 90% of the features may be eliminated with no decrease in classifier
effectiveness. In addition, a list of stop words may be imposed, eliminating function words such
as ‘the’ and ‘to’ which are believed to contribute no meaning to a document (although these may
be relevant for a task such as style detection).
Using the training data, more advanced feature reduction is possible. For instance, using a
chi-square test or some other statistical measure, a word may be found to be frequent in the
training set but independent from the category labels. Such a word would have little
discriminative value and could be ignored.
Latent Semantic Analysis (LSA) is a technique that aims to account for synonymous
words. It involves projecting the feature set onto a new set of “topics” (ideally, mapping
synonyms into the same topic) in order to maximize classification ability. Arbitrarily few topics
can be chosen: a recent study on the lyrics of hit songs used just 8 topics (Dhanaraj and Logan
2005).
Smith 3
3. Text categorization applications
Development originally began on TC to address the task of automatic indexing: given a
set of subject words, the task is to tag each document with those subjects that are relevant
(Sebastiani 2002). In the 1990s, the CONSTRUE TC system was developed for the news agency
Reuters to automatically classify news stories. And today, we rely on TC to filter spam from
authentic email.
The feature extraction techniques described in section 2 have been applied to MIR in
order to analyze lyrics. Analogous to classifying news stories by topic, various studies have
attempted to classify music by genre. One found that while a genre classifier trained on lyric data
performed worse than a classifier trained on audio data, they made different mistakes, suggesting
that lyric and audio data could be profitably combined (Logan et al. 2004). In a later study on
predicting hit songs, lyrics were found to provide more discriminating data than audio (Dhanaraj
and Logan 2005). Another recent study has indicated that a range of lyric features—relating to
grammar, word usage, sentiment, and repetition—may be combined to distinguish between rock,
pop, indie, and related genres that are notoriously difficult to separate using only audio data
(Maxwell 2007).
Lyric and audio data were directly combined in a study on lyric recognition from a
singing voice (Hosoya et al. 2005). Identifying consonants directly from audio can be too
difficult, so in this study only formants were estimated. Vowel patterns were then matched to
words; the task is made easier than it sounds since the only possible matching strings are those in
the database of lyrics. The retrieval task was further simplified by assuming the singer would
only begin at or near the beginning of a phrase (Suzuki et al. 2007).
One problem faced when applying TC to lyrics is that TC techniques often rely on large
sample sets. But where a single text document may contain several thousand words, pop songs
may average only 200 words, limiting the effectiveness of the “bag of words” approach
(Maxwell 2007). New TC techniques are thus currently being developed to tackle lyrics more
effectively. Since a corpus of lyrics can be insufficient to teach the actual set of semantic
relationships that exist in English, one study trained their feature extractor on a thesaurus before
applying it to the lyrics, improving their results (Wei et al. 2007).
4. Conclusion
This review has summarized those aspects of TC relevant to the MIR community,
describing the main methods of extracting and selecting features. Significant applications of TC
were named, followed by a review of current research applying TC to MIR tasks. Overall,
analyses of lyrics have proved very promising. However, current techniques have encountered
challenging obstacles and future research will likely need to develop a new stable of TC
techniques to apply to lyrics.
Smith 4
References
Bainbridge, D., S. Cunningham, and J. Downie. 2003. Analysis of queries to a Wizard-of-Oz
MIR system: Challenging assumptions about what people really want. Proceedings of
IEEE International Conference on Multimedia and Expo.
Dhanaraj, R., and B. Logan. 2005. Automatic prediction of hit songs. International Conference
on Music Information Retrieval, London, UK. 488–91.
Geleijnse, G., and J. Korst. 2006. Efficient lyrics extraction from the web. International
Conference on Music Information Retrieval, Victoria, Canada. 371–2.
Hosoya, T., M. Suzuki, A. Ito, and S. Makino. 2005. Lyrics recognition from a singing voice
based on finite state automaton for music information retrieval. International Conference
on Music Information Retrieval, London, UK. 532–5.
Logan, B., A. Kositsky, and P. Moreno. 2004. Semantic analysis of song lyrics. Proceedings of
IEEE International Conference on Multimedia and Expo. 1–7.
Maxwell, T. 2007. Exploring the music genome: Lyric clustering with heterogeneous features.
M.Sc. Thesis. University of Edinburgh.
Pienimki, A. 2006. Organised evaluation in (music) information retrieval: TREC and MIREX.
Paper presented to the seminar Information Retrieval Research, Nov. 16, University of
Helsinki. <http://www.cs.helsinki.fi/u/linden/teaching/irr06/papers/ap_irr06_final.pdf>
Sebastiani, F. 2002. Machine learning in automated text categorization. Technical Report,
Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59.
Suzuki, M., T. Hosoya, A. Ito, and S. Makino. 2007. Music information retrieval from a singing
voice using lyrics and melody information. EURASIP Journal on Advances in Signal
Processing. Volume 2007, Article ID 38727.
Tizhoosh, H., and R. Dara. 2006. On poem recognition. Pattern Analysis and Applications.
9(4):325–38.
Uzuner, Ö., and B. Katz. 2005. Style versus expression in literary narratives. Proceedings of the
28th Annual International ACM SIGIR Conference. Salvador, Brazil.
Wei, B., C. Zhang, and M. Ogihara. 2007. Keyword generation for lyrics. International
Conference on Music Information Retrieval, Vienna, Austria.121–2.
Download