An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, 1999. E. Riloff and R. Jones, “Learning Dictionaries for Information Extraction Using Multi-level Boot-strapping,” in the Proceedings of AAAI-99, 1999. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text Classification from Labeled and Unlabeled Documents using EM,” in Machine Learning, 2000. M. Grobelnik, D. Mladenic, and N. Milic-Frayling, “Text Mining as Integration of Several Related Research Areas: Report on KDD’2000 Workshop on Text Mining,” 2000. What Is Text Mining? “The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001) “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999) Text Mining • How does it relate to data mining in general? • How does it relate to computational linguistics? • How does it relate to information retrieval? Finding Patterns Non-textual data Textual data General data-mining Computational Linguistics Finding “Nuggets” Novel Non-Novel Exploratory Data Analysis Database queries Information Retrieval Challenges in Text Mining • Data collection is “free text” – Data is not well-organized • Semi-structured or unstructured – Natural language text contains ambiguities on many levels • Lexical, syntactic, semantic, and pragmatic – Learning techniques for processing text typically need annotated training examples • Consider bootstrapping techniques Text Mining Tasks • Exploratory Data Analysis – Using text to form hypotheses about diseases (Swanson and Smalheiser, 1997). • Information Extraction – (Semi)automatically create (domain specific) knowledge bases, and then use standard data-mining techniques. • Bootstrapping methods (Riloff and Jones, 1999). • Text Classification – Useful intermediary step for information extraction • Bootstrapping method using EM (Nigam et al., 2000). Biomedical Data Exploration (Swanson, and Smalheiser, 1997) • Extract pieces of evidence from article titles in the biomedical literature • • • • “stress is associated with migraines” “stress can lead to loss of magnesium” “calcium channel blockers prevent some migraines” “magnesium is a natural calcium channel blocker” • Induce a new hypothesis not in the literature by combining culled text fragments with human medical expertise • Magnesium deficiency may play a role in some kinds of migraine headache Challenges in Data Exploration • How can valid inference links be found without succumbing to combinatorial explosion of possibilities? – Need better models of lexical relationships and semantic constraints (very hard) • How should the information be presented to the human experts to facilitate their exploration? Information Extraction (IE) • Extract domain-specific information from natural language text – Need a dictionary of extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”) • Constructed by hand • Automatically learned from hand-annotated training data – Need a semantic lexicon (dictionary of words with semantic category labels) • Typically constructed by hand Challenges in IE • Automatic learning methods are typically supervised (i.e., need labeled examples) • But annotating training data is a timeconsuming and expensive task. • Can we develop better unsupervised algorithm? • Can we make better use of a small set of labeled example? Learning Dictionaries for IE via Bootstrapping (Riloff and Jones, 1999) • Simultaneously learn extraction patterns and domain-specific semantic lexicons • Input requires a small set of seed words (for the semantic categories) and a large collection of text • Mutual bootstrapping – Learns extraction patterns from seed words – Use extraction patterns to identify new words to add to the semantic categories – Meta-bootstrapping to reduce noise Text classification (TC) • Tag a document as belonging to one of a set of pre-defined classes – “This does not lead to discovery of new information…” (Hearst, 1999). – Many practical uses • Group documents into different domains (useful for domain specific information extraction) • Learn reading interests of users • Automatically sort e-mail • On-line New Event Detection Challenges in TC • Like IE, also need lots of labeled examples as training data – After a user has labeled 1000 UseNet news articles, the system was only right ~50% of the time at selecting articles interesting to the user. • What other sources of information can reduce the need for labeled examples? TC from Labeled and Unlabeled Documents using EM (Nigam et al., 2000) • Expectation-Maximization – Iterative algorithm for MLE in parametric estimation problems with missing data (e.g. the labels for the example) • Nigam et al. combined the EM algorithm with a Naïve Bayes classifier, using both labeled and unlabeled data as input – Dynamically adjust strength of unlabeled data’s contribution to parameter estimation in EM – Reduce the bias of naïve Bayes by modeling each class with multiple mixture components Probabilistic Framework for TC • Assumption #1: Doc produced by mixture model – Generate docs according to probability distribution defined by the model parameters q • Assumption #2: Each class is modeled by one mixture component: C ={c1,…,c|C|} Prob. of model generating doc di is: |C| P(di | q ) P(c j | q ) P(di | c j ;q ) j 1 Naïve Bayes Model • Assumes words in the document are generated independently (no context) • Assume all text have the same length |C | P(d i | q ) P(c j | q ) P(d i | c j ;q ) j 1 P(d i | c j ;q ) P( w1 ,..., w|di | | c j ;q ) |d i | P( wk | c j ;q ) k 1 • Model parameters: q {q w|c ,q c } Using a Trained Model • What class should a new document d be assigned to? P(c | q ) P(d | c;q ) P( Label (d ) c | d ;q ) P(d | q ) • Pick the class with the highest probability Parameter Estimation with Labeled Documents • Estimating model parameters: q {q w|c ,q c } q w|c # (w d ) Ind ( Label (d ) c) P( w | c;q ) # (w' d ' )Ind ( Label (d ' ) c) d D w 'V d 'D q c P (c | q ) Ind ( Label (d ) c) d D |D| Parameter Estimation with Unlabeled Documents • EM: for “incomplete data” problems • Maximize prob. of model generating observed data • Build initial classifier (initialize the parameters to “reasonable” starting values) • Repeat until convergence – E-Step: Use current classifier params, qt, to estimate P(c|d;qt) for all d in Du – M-Step: Re-estimate the classifier, qt+1, using the expected counts from the E-Step Augmented EM • Weight the unlabeled data – Otherwise, unlabeled data overwhelms the small amount of labeled data – Modify M-step to multiply expected counts with a weight factor • Relax the one class one mixture component assumption – Allow labeled data to fall into “topics” within a class – Modify E-step to allow labeled document to probabilistically belong to sub-topics