Carolyn Penstein Ros é
Language Technologies Institute/
Human-Computer Interaction
Institute
TagHelper Tools and SIDE
Define Summaries
Annotate Data
TagHelper Tools uses text mining technology to automate annotation of conversational data
Visualize
Annotated Data
SIDE facilitates rapid prototyping of reporting interfaces for group learning facilitators
How do you know when you have coded enough data?
What distinguishes
Questions and
Statements?
mark.
won’t work
Start TagHelper tools by double clicking on the portal.bat icon in your
TagHelperTools2 folder
You will then see the following tool pallet
The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data
Click on Train New Models
First click on Add a File
Then select a file
Click “GO!”
TagHelper will use its default setting to train a model on your coded examples
It will use that model to assign codes to the uncoded examples
The second option is to modify the default settings
You get to the options you can set by clicking on >> Options
After you finish that, click “GO!”
The performance report tells you:
What dataset was used
What the customization settings were
At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made
The output file contains
The codes for each segment
Note that the segments that were already coded will retain their original code
The other segments will have their automatic predictions
The prediction column indicates the confidence of the prediction
To customize the settings:
Select the file
Click on Options
Rules of thumb:
SMO is state-of-the-art for text classification
* The three main types of feature sets – also handles contingencies between features well
made based on accumulating evidence rather than hard and fast rules
Basic Idea
Represent text as a vector where each position corresponds to a term
This is called the “bag of words” approach
But same representation for “Cheese makes cows.”!
Cheese
Cows
Eggs
Hens
Lay
Make
Cows make cheese
110001
Hens lay eggs
001110
What can’t you conclude from “bag of words” representations?
Causality: “ X caused Y ” versus “ Y caused X ”
Roles and Mood: “ Which person ate the food that I prepared this morning and drives the big car in front of my cat ” versus “ The person, which prepared food that my cat and I ate this morning, drives in front of the big car .”
Who’s driving, who’s eating, and who’s preparing food?
Basic Anatomy: Layers of
Linguistic Analysis
Phonology : The sound structure of language
Basic sounds, syllables, rhythm, intonation
Morphology : The building blocks of words
Inflection: tense, number, gender
Derivation: building words from other words, transforming part of speech
Syntax : Structural and functional relationships between spans of text within a sentence
Phrase and clause structure
Semantics : Literal meaning, propositional content
Pragmatics : Non-literal meaning, language use, language as action, social aspects of language (tone, politeness)
Discourse Analysis : Language in practice, relationships between sentences, interaction structures, discourse markers, anaphora and ellipsis
http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition/subord
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10.LS List item marker
11.MD Modal
12.NN Noun, singular or mass
13.NNS Noun, plural
14.NNP Proper noun, singular
15.NNPS Proper noun, plural
16.PDT Predeterminer
17.POS Possessive ending
18.PRP Personal pronoun
19.PP Possessive pronoun
20.RB Adverb
21.RBR Adverb, comparative
22.RBS Adverb, superlative
http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
23.RP Particle
24.SYM Symbol
25.TO to
26.UH Interjection
27.VB Verb, base form
28.VBD Verb, past tense
29.VBG Verb, gerund/present participle
30.VBN Verb, past participle
31.VBP Verb, non-3rd ps. sing. present
32.VBZ Verb, 3rd ps. sing. present
33.WDT wh-determiner
34.WP wh-pronoun
35.WP Possessive whpronoun
36.WRB wh-adverb
TagHelper Customizations
Feature Space Design
Think like a computer!
Machine learning algorithms look for features that are good predictors, not features that are necessarily meaningful
Look for approximations
If you want to find questions, you don’t need to do a complete syntactic analysis
Look for question marks
Look for wh-terms that occur immediately before an auxilliary verb
TagHelper Customizations
Feature Space Design
Punctuation can be a “stand in” for mood
“you think the answer is 9?”
“you think the answer is 9.”
Bigrams capture simple lexical patterns
“common denominator” versus
“common multiple”
POS bigrams capture syntactic or stylistic information
“the answer which is …” vs “which is the answer”
Line length can be a proxy for explanation depth
TagHelper Customizations
Feature Space Design
Contains non-stop word can be a predictor of whether a conversational contribution is contentful
“ok sure” versus “the common denominator”
Remove stop words removes some distracting features
Stemming allows some generalization
Multiple, multiply, multiplication
Removing rare features is a cheap form of feature selection
Features that only occur once or twice in the corpus won’t generalize, so they are a waste of time to include in the vector space
Why create new features by hand?
Rules
For simple rules, it might be easier and faster to write the rules by hand instead of learning them from examples
Features
More likely to capture meaningful generalizations
Build in knowledge so you can get by with less training data
ANY() is used to create lists
COLOR = ANY(red,yellow,green,blue,purple)
FOOD = ANY(cake,pizza,hamburger,steak,bread)
ALL() is used to capture contingencies
ALL(cake,presents)
More complex rules
ALL(COLOR,FOOD)
* Note that you may wish to use part-of-speech tags in your rules!
What can you do with this rule language?
You may want to generalize across sets of related words
Color = {red,yellow,orange,green,blue}
Food = {cake,pizza,hamburger,steak,bread}
You may want to detect contingencies
The text must mention both cake and presents in order to count as a birthday party
You may want to combine these
The text must include a Color and a Food
* For small datasets, first deselect Remove rare features .
* Next, Click on Adv Feature Editing
* Now you may begin creating your own features.
Primitive features inclulde unigrams, bigrams, and POS bigrams
The Options change which primitive features show up in the Unigram,
Bigram, and POS bigram lists
You can choose to remove stopwords or not
You can choose whether or not to strip endings off words with stemming
You can choose how frequently a feature must appear in your data in order for it to show up in your lists
* Now let’s look at how to create new features.
* You can use the feature editor to create new features.
* First click on ANY
* Then click ALL
* Now fill in ‘tell’ and ‘me’
* Now fill in the rest of the pattern from the POS
Bigram list
* Now change the name
* Click to add to feature list