Importance of Semantic Representation: Dataless Classification Ming-Wei Chang Lev Ratinov Dan Roth Vivek Srikumar University of Illinois, Urbana-Champaign Text Categorization Classify the following sentence: Syd Millar was the chairman of the International Rugby Board in 2003. Pick a label: Class1 vs. Class2 Traditionally, we need annotated data to train a classifier Slide 1 Text Categorization Humans don’t seem to need labeled data Syd Millar was the chairman of the International Rugby Board in 2003. Pick a label: Sports vs. Finance Label names carry a lot of information! Slide 2 Text Categorization Do we really always need labeled data? Slide 3 Contributions We can often go quite far without annotated data … if we “know” the meaning of text This works for text categorization ….and is consistent across different domains Slide 4 Outline Semantic Representation On-the-fly Classification Datasets Exploiting unlabeled data Robustness to different domains Slide 5 Outline Semantic Representation On-the-fly Classification Datasets Exploiting unlabeled data Robustness to different domains Slide 6 Semantic Representation One common representation is the Bag of Words representation All text is a vector in the space of words. Slide 7 Semantic Representation Explicit Semantic Analysis [Gabrilovich & Markovitch, 2006, 2007] Text is a vector in the space of concepts Concepts are defined by Wikipedia articles Slide 8 Explicit Semantic Analysis: Example Apple IPod Monetary Policy ESA representation ESA representation IPod mini International Monetary Fund IPod photo Monetary policy IPod nano Wikipedia Economic and Monetary Union Apple Computer Hong Kong Monetary Authority IPod shuffle Monetarism ITunes Central bank article titles Slide 9 Semantic Representation Two semantic representations Bag of words ESA Slide 10 Outline Semantic Representation On-the-fly Classification Datasets Exploiting unlabeled data Robustness to different domains Slide 11 Traditional Text Categorization Labeled corpus Sports Finance Semantic space A classifier Slide 12 Dataless Classification Labeled Labels corpus Sports Finance What can we do using just the labels? Slide 13 But labels are text too! Slide 14 Dataless Classification New unlabeled Labels document Sports Finance Semantic space Slide 15 What is Dataless Classification? Humans don’t need training for classification Annotated training data not always needed Look for the meaning of words Slide 16 What is Dataless Classification? Humans don’t need training for classification Annotated training data not always needed Look for the meaning of words Slide 17 On-the-fly Classification New unlabeled Labels document Sports Finance Semantic space Slide 18 On-the-fly Classification No training data needed We know the meaning of label names Pick the label that is closest in meaning to the document Nearest neighbors Slide 19 On-the-fly Classification New unlabeled New labels document Hockey Baseball Semantic space Slide 20 On-the-fly Classification No need to even know labels before hand Compare with traditional classification Annotated training data for each label Slide 21 Outline Semantic Representation On-the-fly Classification Datasets Exploiting unlabeled data Robustness to different domains Slide 22 Dataset 1: Twenty Newsgroups Posts to newsgroups Newsgroups have descriptive names sci.electronics = Science Electronics rec.motorbikes = Motorbikes Slide 23 Dataset 2: Yahoo Answers Posts to Yahoo! Answers Posts categorized into a two level hierarchy 20 top level categories Totally 280 categories at the second level Arts and Humanities, Theater Acting Sports, Rugby League Slide 24 Experiments 20 Newsgroups 10 binary problems (from [Raina et al, ‘06]) Religion vs. Politics.guns Motorcycles vs. MS Windows Yahoo! Answers 20 binary problems Health, Diet fitness vs. Health Allergies Consumer Electronics DVRs vs. Pets Rodents Slide 25 Results: On-the-fly classification Dataset Supervised Baseline Bag of Words ESA Newsgroup 71.7 65.7 85.3 Yahoo! 84.3 66.8 88.6 Naïve Bayes classifier Nearest neighbors, Uses annotated data, Ignores labels Uses labels, No annotated data Slide 26 Outline Semantic Representation On-the-fly Classification Datasets Exploiting unlabeled data Robustness to different domains Slide 27 Using Unlabeled Data Knowing the data collection helps We can learn specific biases of the dataset Potential for semi-supervised learning Slide 28 Bootstrapping Each label name is a “labeled” document Train initial classifier One “example” in word or concept space Same as the on-the-fly classifier Loop: Classify all documents with current classifier Retrain classifier with highly confident predictions Slide 29 Co-training Words and concepts are two independent “views” Each view is a teacher for the other [Blum & Mitchell ‘98] Slide 30 Co-training Train initial classifiers in word space and concept space Loop Classify documents with current classifiers Retrain with highly confident predictions of both classifiers Slide 31 Using unlabeled data Three approaches Bootstrapping with labels using Bag of Words Bootstrapping with labels using ESA Co-training Slide 32 More Results Co-training using just labels does as well as supervision with 100 examples 100 Accuracy (%) 95 90 Supervised, 10 examples Supervised, 100 exmples Bootstrap, Words Bootstrap, Concepts Co-Train 85 80 75 70 Newsgroup Yahoo! No annotated data Slide 33 Outline Semantic Representation On-the-fly Classification Datasets Exploiting unlabeled data Robustness to different domains Slide 34 Domain Adaptation Classifiers trained on one domain and tested on another Performance usually decreases across domains Slide 35 But the label names are the same Label names don’t depend on the domain Label names are robust across domains On-the-fly classifiers are domain independent Slide 36 Example Baseball vs. Hockey 98 Accuracy (%) 96 94 92 Test on Newsgroup Test on Yahoo! 90 88 86 84 Train on Newsgroup Train on Yahoo Dataless Slide 37 Conclusion Sometimes, label names are tell us more about a class than annotated examples Standard learning practice of treating labels as unique identifiers loses information The right semantic representation helps What is the right one? Slide 38