TagHelper: User’s Manual Carolyn Penstein Rosé (cprose@cs.cmu.edu) Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval Research, Cognitive and Neural Sciences Division Licensed under GNU General Public License Copyright 2007, Carolyn Penstein Rosé, Carnegie Mellon University Setting Up Your Data Setting Up Your Data Creating a Trained Model Training and Testing Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder You will then see the following tool pallet The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data Click on Train New Models Loading a File First click on Add a File Then select a file Simplest Usage Click “GO!” TagHelper will use its default setting to train a model on your coded examples It will use that model to assign codes to the uncoded examples More Advanced Usage The second option is to modify the default settings You get to the options you can set by clicking on >> Options After you finish that, click “GO!” Options Here is where you set the options They are discussed in more detail below Output You can find the output in the OUTPUT folder There will be a text file named Eval_[name of coding dimension]_[name of input file].txt This is a performance report E.g., Eval_Code_SimpleExample.xls.txt There will also be a file named [name of input file]_OUTPUT.xls This is the coded output E.g., SimpleExample_OUTPUT.xls Using the Output file Prefix If you use the Output file prefix, the text you enter will be prepended to the output files There will be a text file named [prefix]_Eval_[name of coding dimension]_[name of input file].txt E.g., Prefix1_Eval_Code_SimpleExample.xls.txt There will also be a file named [prefix]_[name of input file]_OUTPUT.xls E.g., Prefix1_SimpleExample.xls Evaluating Performance Performance report The performance report tells you: What dataset was used Performance report The performance report tells you: What dataset was used What the customization settings were Performance report The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made Output File The output file contains The codes for each segment Note that the segments that were already coded will retain their original code The other segments will have their automatic predictions The prediction column indicates the confidence of the prediction Using a Trained Model Applying a Trained Model Select a model file Then select a testing file Applying a Trained Model Testing data should be set up with ? on uncoded examples Click Go! to process file Results Overview of Basic Feature Extraction from Text Customizations To customize the settings: Select the file Click on Options Setting the Language You can change the default language from English to German Chinese requires an additional license to Academia Sinica in Taiwan Preparing to get a performance report You can decide whether you want it to prepare a performance report for you. (It runs faster when this is disabled.) Classifier Options Rules of thumb: SMO is state-of-the-art for text classification J48 is best with small feature sets – also handles contingencies between features well Naïve Bayes works well for models where decisions are made based on accumulating evidence rather than hard and fast rules Represent text as a vector where each position corresponds to a term This is called the “bag of words” approach Cheese Cows Eggs Hens Lay Make Cows make cheese 110001 Hens lay eggs 001110 What can’t you conclude from “bag of words” representations? Causality: “X caused Y” versus “Y caused X” Roles and Mood: “Which person ate the food that I prepared this morning and drives the big car in front of my cat” versus “The person, which prepared food that my cat and I ate this morning, drives in front of the big car.” Who’s driving, who’s eating, and who’s preparing food? X’ Structure A complete phrase X’’ Sometimes called “a maximal projection” X’ X’ X Spec The Pre-head Mod black Head cat Post-head Mod in the hat Basic Anatomy: Layers of Linguistic Analysis Phonology: The sound structure of language Basic sounds, syllables, rhythm, intonation Morphology: The building blocks of words Inflection: tense, number, gender Derivation: building words from other words, transforming part of speech Syntax: Structural and functional relationships between spans of text within a sentence Phrase and clause structure Semantics: Literal meaning, propositional content Pragmatics: Non-literal meaning, language use, language as action, social aspects of language (tone, politeness) Discourse Analysis: Language in practice, relationships between sentences, interaction structures, discourse markers, anaphora and ellipsis Part of Speech Tagging http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal 12.NN Noun, singular or mass 13.NNS Noun, plural 14.NNP Proper noun, singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative Part of Speech Tagging http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html 23.RP Particle 24.SYM Symbol 25.TO to 26.UH Interjection 27.VB Verb, base form 28.VBD Verb, past tense 29.VBG Verb, gerund/present participle 30.VBN Verb, past participle 31.VBP Verb, non-3rd ps. sing. present 32.VBZ Verb, 3rd ps. sing. present 33.WDT wh-determiner 34.WP wh-pronoun 35.WP Possessive whpronoun 36.WRB wh-adverb TagHelper Customizations Feature Space Design Think like a computer! Machine learning algorithms look for features that are good predictors, not features that are necessarily meaningful Look for approximations If you want to find questions, you don’t need to do a complete syntactic analysis Look for question marks Look for wh-terms that occur immediately before an auxilliary verb TagHelper Customizations Feature Space Design Punctuation can be a “stand in” for mood “you think the answer is 9?” “you think the answer is 9.” Bigrams capture simple lexical patterns “common denominator” versus “common multiple” POS bigrams capture syntactic or stylistic information “the answer which is …” vs “which is the answer” Line length can be a proxy for explanation depth TagHelper Customizations Feature Space Design Contains non-stop word can be a predictor of whether a conversational contribution is contentful “ok sure” versus “the common denominator” Remove stop words removes some distracting features Stemming allows some generalization Multiple, multiply, multiplication Removing rare features is a cheap form of feature selection Features that only occur once or twice in the corpus won’t generalize, so they are a waste of time to include in the vector space Group Activity Use TagHelper features to make up rules to identify thematic roles in these sentences? Agent: who is doing the action Theme: what the action is done to Recipient: who benefits from the action Source: where the theme started Destination: where the theme ended up Tool: what the agent used to do the action to the theme Manner: how the agent behaved while doing the action 1. The man chased the intruder. 2. The intruder was chased by the man. 3. Aaron carefully wrote a letter to Marilyn. 4. Marilyn received the letter. 5. John moved the package from the table to the sofa. 6. The governor entertained the guests in the parlor. New Feature Creation Why create new features? You may want to generalize across sets of related words Color = {red,yellow,orange,green,blue} Food = {cake,pizza,hamburger,steak,bread} You may want to detect contingencies The text must mention both cake and presents in order to count as a birthday party You may want to combine these The text must include a Color and a Food Why create new features by hand? Rules For simple rules, it might be easier and faster to write the rules by hand instead of learning them from examples Features More likely to capture meaningful generalizations Build in knowledge so you can get by with less training data Rule Language ANY() is used to create lists COLOR = ANY(red,yellow,green,blue,purple) FOOD = ANY(cake,pizza,hamburger,steak,bread) ALL() is used to capture contingencies ALL(cake,presents) More complex rules ALL(COLOR,FOOD) Group Project: Make a rule that will match against questions but not statements Question Tell me what your favorite color is. Statement I tell you my favorite color is blue. Question Where do you live? Statement I live where my family lives. Question Which kinds of baked goods do you prefer Statement I prefer to eat wheat bread. Question Which courses should I take? Statement You should take my applied machine learning course. Question Tell me when you get up in the morning. Statement I get up early. Possible Rule ANY(ALL(tell,me),BOL_WDT,BOL_WRB) Advanced Feature Editing * Click on Adv Feature Editing * For small datasets, first deselect Remove rare features. Types of Basic Features Primitive features inclulde unigrams, bigrams, and POS bigrams Types of Basic Features The Options change which primitive features show up in the Unigram, Bigram, and POS bigram lists You can choose to remove stopwords or not You can choose whether or not to strip endings off words with stemming You can choose how frequently a feature must appear in your data in order for it to show up in your lists Types of Basic Features * Now let’s look at how to create new features. Creating New Features *The feature editor allows you to create new feature definitions * Click on + to add your new feature Examining a New Feature •Right click on a feature to examine where it matches in your data Examining a New Feature Adding new features by script Modify the ex_features.txt file Allows you to save your definitions Easier to cut and paste Error Analysis Create an Error Analysis File Use TagHelper to Code Uncoded File •The output file contains the codes TagHelper assigned. •What you want to do now is to remove prediction column and insert the correct answers next to the TagHelper assigned answers. Load Error Analysis File Error Analysis Strategies Look for large error cells in the confusion matrix Locate the examples that correspond to that cell What features do those examples share? How are they different from the examples that were classified correctly? Group Project From NewGroupTopic.xls create NewsGroupTrain.xls, NewsGroupTest.xls, and NewsGroupAnswers.xls Load in the NewsGroupTrain.xls data set What is the best performance you can get by playing with the standard TagHelper tools feature options? Train a model using the best settings and then use it to assign codes to NewsGroupTest.xls Copy in Answer column from NewsGroupAnswers.xls Now do an error analysis to determine why frequent mistakes are being made How could you do better? Feature Selection Why do irrelevant features hurt performance? They might confuse a classifier They waste time Two Solutions Use a feature selection algorithm Only extract a subset of possible features Feature Selection * Click on the AttributeSlectedClassifier Feature Selection Feature selection algorithms pick out a subset of the features that work best Usually they evaluate each feature in isolation Feature Selection * First click here * Then pick your base classifier just like before * Finally you will configure the feature selection Setting Up Feature Selection Setting Up Feature Selection The number of features you pick should not be larger than the number of features available The number should not be larger than the number of coded examples you have Examining Which Features are Most Predictive You can find a ranked list of features in the Performance Report if you use feature selection * Predictiveness score * Frequency Optimization Key idea: combine multiple views on the same data in order to increase reliability Boosting In boosting, a series of models are trained and each trained model is influenced by the strengths and weaknesses of the previous model New models should be experts in classifying examples that the previous model got wrong It specifically seeks to train multiple models that complement each other In the final vote, model predictions are weighted based on their model’s performance More about Boosting The more iterations, the more confident the trained classifier will be in its predictions But higher confidence doesn’t necessarily mean higher accuracy! When a classifier becomes overly confident, it is said to “over fit” Boosting can turn a weak classifier into a strong classifier A simple classifier can learn a complex rule Boosting Boosting is an option listed in the Meta folder, near the Attribute Selected Classifier It is listed as AdaBoostM1 Go ahead and click on it now Boosting * Now click here Setting Up Boosting * Select a classifier * Set the number of cycles of boosting Semi-Supervised Learning Using Unlabeled Data If you have a small amount of labeled data and a large amount of unlabeled data: you can use a type of bootstrapping to learn a model that exploits regularities in the larger set of data The stable regularities might be easier to spot in the larger set than the smaller set Less likely to overfit your labeled data Semi-supervised Learning Remember the Basic idea: Train on a small amount of data Add the positive and negative example you are most confident about to the training data Retrain Keep looping until you label all the data Semi-supervised learning in TagHelper tools