When Bad is Good: Effect of Ebonics on Computational Language

advertisement
When Bad is Good: Effect
of Ebonics on Computational
Language Processing
Tamsin Maxwell
5 November 2007
Acknowledgment
• IP notice - some background slides thanks to:
– Dan Jurafsky, Jim Martin, Bonnie Dorr
(Linguist180, 2007)
– Mark Wasson, LexisNexis (2002)
– Theresa Wilson (PhD work 2006)
– Janyce Wiebe, U. Pittsburgh (2006)
Overview
Part 1
• Comp Ling and ‘free text’
• Ebonics in lyrics
• Comp Ling approach to language
• Example - part of speech (POS) tagging
• Application - sentiment analysis
Part 2
• Methodology
• Results
• The broader picture
Comp Ling and ‘free text’
• Just like people, Comp Ling relies on having lots of
examples to learn from
• Most advances based on large text collections
– Brown corpus: 1 million words, 15 genres (1961)
– GigaWord corpus…trillion word corpus
• Training data is generally ‘clean’ but real language is
messy
• Can we successfully use techniques developed on
clean data to understand ‘free text’?
The trouble with text
• We’re dealing with language!
– Text/speech may lack structure that traditional
processing and mining techniques can exploit
– Information within text may be implicit
– Ambiguity
– Disfluency
– Error, sarcasm, invention, creativity….
• Contrast with spreadsheets, databases, etc.
– Well-defined structure
Back to basics
• “Not enough focus on the data”
– Collection
– Cleansing
– Scale
– Completeness, including non-traditional sources
– Structure
• “Too much focus on algorithms”
–
Mark Wasson (LexisNexis):
What is ebonics?
• Ebonics, or AAVE (African American Vernacular
English), differs from ‘regular’ English
–
–
–
–
Pronunciation: “tief” - steal / “dem” - them or those
Grammar: “he come coming in here” - he comes in here
Vocabulary: “kitchen” - kinky hair at nape of neck
Slang: “yard axe” - preacher of little ability
• Many ebonics words, phrases, pronunciations have
become common parlance
– ain't, gimme, bro, lovin, wanna
Green, Lisa J. (2002). “African American English: A Linguistic Introduction”,
Cambridge University Press
Source of ebonics
• Rooted in African tradition but very much American
–
–
–
–
Southern rural during slavery
Slang 1900-1960 sinner-man black musician
Street culture, rap and hip-hop
Working class
Lyrics data
• 2392 popular song lyrics from 2002
• 496 artists, mostly pop/rock and some rap/hip-hop
• From user-upload public websites, so very messy!
–
–
–
–
–
–
Misspelling (e.g. dancning)
Phonetic spelling (e.g. cit-aa-aa-aa-aaay)
Annotation (e.g. feat xxx, [Dre])
Abbreviation (e.g. chorus x2)
Too much punctuation or none e.g. no line breaks
Foreign language
Ebonics lexicon
Approximately 15% of all tokens
Ebonics
• Phonetic translations (pronunciation)
• Slang
• Abbreviations
Also includes
• Spelling errors
• Unintended word conjunctions
• Named entities (people, places, brands)
Lexicon examples
Lexicon examples
Ebonics frequency
Ebonics in red
‘Regular’ English in blue
Dictionary word number
Ebonics in lyrics
• Red in curve suggests many frequent and
differentiating words are ebonics terms
• These terms have largely replaced their regular
English counterparts
– Cos, coz, cuz and cause for because
– Wanna for want to
• Red streaks in the tail suggest certain artists have very
high proportions of ‘ebonics’
– Others may use only very common terms
Comp Ling ABC
• Computer programmes have three basic tasks:
– Check conditions (if x=y, then…)
– Perform tasks in order (procedure)
– Repeat tasks (iterate)
• This underlies Comp Ling analysis e.g.
–
–
–
–
–
Break text into chunks (words, clauses, sentences)
Check if chunks meet specified conditions
Represent conditions with numbers (e.g. counts, binary)
Repeat for different conditions
Use maths to find the most likely solution to all conditions
An example: POS tagging
An example
•
•
•
•
•
Break text into chunks (words, clauses, sentences)
Check if chunks meet specified conditions
Represent results with numbers (e.g. counts, binary)
Repeat for different conditions
Use maths to find the most likely solution to all
conditions
Tokenise the data
• Each line of lyrics tokenised into unique tokens (words)
• Punctuation split off by default
– Where are you?  Where / are / you / ?
– It’s about time  It / ‘s / about / time
• Very tricky: how to reattach punctuation?
– U.S.A.  U / . / S / . / A / .
– Good lookin’  Good / lookin / ’
– Get / in / ‘ / cuz / we / ’re / ready / to / go 
• “Get in because” ?
• “Getting because” ?
An example
•
•
•
•
•
Break text into chunks (words, clauses, sentences)
Check if chunks meet specified conditions
Represent results with numbers (e.g. counts, binary)
Repeat for different conditions
Use maths to find the most likely solution to all
conditions
Focus the problem
• Convert language into metadata
• In this example, convert tokens into POS tags
• Words often have more than one POS: back
–
–
–
–
The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
• Tags constrained by conditions:
– Condition: what tags are possible?
– Condition: what is the preceding/following word?
Most frequent tag
• Baseline statistical tagger
– Create a dictionary with each possible tag for a word
– Take a tagged corpus
– Count the number of times each tag occurs for that
word
– Pick the most frequent tag independent of
surrounding words (“unigram”)
• Around 90% accuracy on news text
An example
•
•
•
•
•
Break text into chunks (words, clauses, sentences)
Check if chunks meet specified conditions
Represent results with numbers (e.g. counts, binary)
Repeat for different conditions
Use maths to find the most likely solution to all
conditions
Most frequent tag
•
Which POS is more likely in a corpus (1,273,000 tokens)?
race
NN
400
VB
600
Total
1000
• P(NN|race) = P(race&NN) / P(race) by the definition of
conditional probability
– P(race)  1000/1,273,000 = .0008
– P(race&NN)  400/1,273,000 =.0003
– P(race&VB)  600/1,273,000 = .0005
• And so we obtain:
– P(NN|race) = P(race&NN)/P(race) = .0003/.0008 =.375
– P(VB|race) = P(race&VB)/P(race) = .0004/.0008 = .625
An example
•
•
•
•
•
Break text into chunks (words, clauses, sentences)
Check if chunks meet specified conditions
Represent results with numbers (e.g. counts, binary)
Repeat for different conditions
Use maths to find the most likely solution to all
conditions
POS tagging as sequence
classification
• We are given a sentence ( a “sequence of observations”)
– Secretariat is expected to race tomorrow
• What is the best sequence of tags which corresponds to
this sequence of observations?
• Probabilistic view:
– Consider all possible sequences of tags
– Out of this universe of sequences, choose the tag
sequence which is most probable given the
observations
Disambiguating “race”
An example
•
•
•
•
•
Break text into chunks (words, clauses, sentences)
Check if chunks meet specified conditions
Represent results with numbers (e.g. counts, binary)
Repeat for different conditions
Use maths to find the most likely solution to all
conditions
Hidden Markov Models
• We want, out of all sequences of n tags t1…tn the single
tag sequence such that P(t1…tn|w1…wn) is highest
• This equation is guaranteed to give us the best tag
sequence
• Hat ^ means “our estimate of the best one”
• Argmaxx f(x) means “the x such that f(x) is maximized”
The solution
•
•
•
•
•
•
•
•
P(NN|TO) = .00047
P(VB|TO) = .83
P(race|NN) = .00057
P(race|VB) = .00012
P(NR|VB) = .0027
P(NR|NN) = .0012
P(VB|TO)P(NR|VB)P(race|VB) = .00000027
P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading
But hang on a minute…
• Most-frequent-tag approach has a problem!
• What about words that don’t appear in the training set?
– E.g. Colgate, goose-pimple, headbanger
• New words added to (newspaper) language 20+ per
month
• Plus many proper names …
• “Out of vocabulary” (OOV) words increase tagging
error rates
Handling unknown words
• Method 1: assume they are nouns
• Method 2: assume the unknown words have a
probability distribution similar to words only occurring
once in the training set.
• Method 3: Use morphological information, e.g., words
ending with –ed tend to be tagged VBN.
The bottom line
• POS tagging is widely considered a solved problem
• Statistical taggers achieve about 97% per-word
accuracy - the same as inter-annotator agreement
Ongoing issues
• Incorrect tags affect subsequent tagging decisions more likely to find strings of errors
• OOV words (in the training corpus) cause most errors
• Statistical taggers are state-of-the-art but may not be
as portable to new domains as rule-based taggers
What is POS tagging good for?
• First step for vast number of Comp Ling tasks
• Speech synthesis:
– How to pronounce “lead”?
– INsult
inSULT
– OBject
obJECT
– OVERflow
overFLOW
• Word prediction in speech recognition and etc
– Possessive pronouns (my, your, her) followed by nouns
– Personal pronouns (I, you, he) likely to be followed by
verbs
What is POS tagging good for?
• Parsing: Need to know POS in order to parse
S
NP
VP
S
PP
DT
JJ
NP
NN VBZ IN DT NN
VP
VP
NP
The representative put chairs on the table
DT
NP
NN
PP
VBD NNS IN DT NN
The representative put chairs on the table
What is POS tagging good for?
• Chunking - dividing text into noun and verb groups
– E.g. rules that permit optional adverbials and
adjectives in passive verb groups
• Was probably sent or was sent away
• Require that the tag for was be VBN or VBD (since
POS taggers cannot reliably distinguish the two)
• Sentiment analysis - deciding when a word reflects
emotion
– “I like the way you walk”
– “Beat me like Better Crocker cake mix”
Sentiment detection
Opinion mining
• Opinions, evaluations, emotions, speculations are
private states
• Find relevant words, phrases, patterns that can be
used to express subjectivity
• Determine the polarity of subjective expressions
• Expressions directed towards an object
Emotion detection
• All text reflects personal emotion
• May be undirected
Polarity
• Focus on positive/negative and strong/weak emotions,
evaluations, stances
 sentiment analysis
I’m ecstatic the Steelers won!
She’s against the bill.
Prior and Contextual Polarity
Lexicon
abhor: negative
acrimony: negative
...
cheers: positive
...
beautiful: positive
...
horrid: negative
...
woe: negative
wonderfully: positive
Prior polarity
(out of context)
“Cheers to Timothy Whitfield
for the wonderfully horrid
visuals.”
Shifting polarity
• Negation - flip or intensify
– Not good
– Not only good but amazing
• Diminishers - flip
– Little truth
– Lack of understanding
• Word sense - shift
– The old school was condemned in April
– The election was condemned for being rigged
Turney (2002)*
• What happens if we take a simple approach?
• Classified reviews thumbs up/down
• Sum of polarities for all words with affective meaning
in a text
• 74% accuracy (66 - 84% various domains)
Lexicon
All
Instances
Corpus
Step 1
Neutral
or
Polar?
Step 2
Polar
Instances Contextual
Polarity?
*Turney, P.D. (2002). Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL), pp. 417-424.
Sentiment Lexicon
• Developed by Wilson et al (2005)
– First two dimensions of Charles Osgood’s Theory of
Semantic Differentiation
– Evaluation or prior polarity (positive/negative/both/neutral)
– Potency or reliability (strong/weak subjective)
– Does not include activity (passive/active)
• Over 8,000 words
– Both manually and automatically identified
– Positive/negative words from General Inquirer and
Hatzivassiloglou and McKeown (1997)
T. Wilson, J. Wiebe and P.Hoffmann (2005), “Recognizing contextual polarity in phrase-level sentiment
analysis”. HLT '05, p 347--354.
C.E. Osgood, G.J. , P.H. Tannenbaum (1957), “The Measurement of Meaning”, University of Illinois Press
Word entries
• Adjectives1 (7.5% all text)
– Positive: honest, important, mature, large, patient
– Negative: harmful, hypocritical, inefficient, insecure
– Subjective: curious, peculiar, odd, likely, probable
• Verbs2
– positive: praise, love
– negative: blame, criticize
– subjective: predict
• Nouns2
– positive: pleasure, enjoyment
– negative: pain, criticism
– subjective: prediction
• Obscenities
– Five slang words added to avoid obvious error
1.
2.
Hatzivassiloglou & McKeown 1997, Wiebe 2000, Kamps & Marx 2002, Andreevskaia & Bergler 2006
Turney & Littman 2003, Esuli & Sebastiani 2006
Lexicon
All
Instances
Corpus
1.
Word features
2.
3.
4.
5.
Modification features
Structure features
Sentence features
Document feature
Step 1
Neutral
or
Polar?
Step 2
Polar
Instances Contextual
Polarity?
• Word token
• Word part-of-speech
• Context
• Lemma
• Prior Polarity
• Reliability
terrifies
VB
that terrifies me
terrify
negative
strongsubj
Where does ebonics fit in?
• Standardising ebonics terms in ‘free’ language may
improve accuracy of language processing tools e.g.
POS taggers
• Also benefit subsequent analysis e.g. sentiment
detection
• Possible application to:
–
–
–
–
User-generated web content (e.g. in chat rooms)
Email
Speech
Creative writing
Methodology - ebonics
• Check every token against the CELEX dictionary
– If not found, check the suffix
• attempt correction and re-check CELEX (no errors)
– If still not found, write to ebonics lexicon
in / in’  ing
a / a' (word length > 2)  er
er  re
or  our
runnin becomes running
brotha becomes brother
center becomes centre
color becomes colour
• Write dictionary translation by hand
– Include stress pattern for rhythm analysis
– Keep number of tokens and POS the same if possible
Methodology - features
• Extract features: POS, language, sentiment, repetition
• POS tags
– Number of words for each UPenn tag
– Number of words in tag categories {NN, VB, JJ, RB, PRP}
• Language
–
–
–
–
–
Number of regular abbreviations (e.g. don't)
Number of frontal contractions (e.g. 'cause)
Number of modal, passive and active verbs
Average, minimum and maximum line lengths
Language formality = % formal words (NN, JJ, IN, DT)
versus informal words (PRP, RB, UH, VB)
– Word variety = type to token ratio
Methodology - features
• Extract features: POS, language, sentiment, repetition
• Sentiment
– Number of words strong/weak positive
– Number of words strong/weak negative
• Repetition
–
–
–
–
Number of phrase repetitions
Phrase length
Number of lines repeated in full
Number of unique (non-repeated) lines
• Combination
– Combined features from above categories, edited to
remove correlations
Methodology - corpus size
• Subsequent analyses use neural networks that
prefer more data
• But ebonics lexicon is labor intensive
• Want to observe impact of reduced data set
• Standardise ebonics for approximately half the data
(1037 songs out of 2392)
Ebonics and POS
• Stability within major tag categories
– Nouns most shift NN to NNP / NNPS to NNP
– High frequency of named entities
– “be my Yoko Ono”, “served like Smuckers jam”
• Some shift from NN to VBG/VBZ tags
– “what's happ'ning”
• Increase in comparative adjectives
– “a doper flow”
• Ebonics does not appear to be important for POS
tagging unless text is from traditional ebonics source
Results
Ebonics and POS
• Most instances of ebonics don’t affect POS tagging
– “grinders” meaning sandwiches
– “deco umbrella” meaning average uterus
• Some ebonics affects tag counts
– “sloppy joe” meaning hamburger
• Phrasal tagging errors can be caused by surprising
CELEX entries
– “shama lama ding dong dip dip dip” all but two words
appear in CELEX
– Q: Which two?
A: ding and dong
Ebonics and POS
• Most slang does not affect POS tagging
• Phrasal slang does affect tagging - one to many
relation
– “doing hunny in a sixty-five” meaning speeding
(recklessly, on an American freeway)
Ebonics and features
• Sentiment features responded most to ebonics edit
– Before edit average 0-15 positive words, 1-13 negative
– After edit average 0-28 positive words, 0-21 negative
• Other features virtually no effect after edit
• Ebonics terms likely to be charged with sentiment
• Text may benefit from ebonics editing prior to
sentiment analysis
Results
Data distribution
• Doubling data did not substantially affect data
distribution
• Effect on outliers - more data means more variety
• Ebonics edit makes data more consistent
– Effect on machine learning
– More sensitive to small changes
– May be loosing useful information
Conclusions
• Ebonics does not appear to be important for POS
tagging unless text is from traditional ebonics source
– How do we identify these texts?
• Text may benefit from ebonics editing prior to
sentiment analysis
– Should we be watching for grammar changes too?
– Should we finesse sentiment or language analysis to
improve results?
Thank you.
Questions?
The bigger picture
• Discovering the music genome: using heterogenous
lyric features to classify music
Why?
• Automatically fill music databases
• Automate online playlists / radio
• Retrieve relevant music in multimedia search
• Better interface for your iPod
• Might adding text-based information to existing data
mining applications make them better?
Knowledge discovery
• The non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data.
– Support predictions or classification
– Explain existing data, not just describe it
Knowledge discovery
methodology
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Self-organising maps
•som_show(M,'umat', 'all','comp','all');
•som_show(M,'empty','Labels')
•M=som_autolabel(M,D);
•som_show_add('label',M)
•som_show(M,'color',{p{i},[int2str(i),' clusters']})
Gold data (prior to cleaning)
Distance matrix + hits
Distance matrix
Language
All data
Ebonic off:
half data
Ebonic on:
half data
The data problem
A
B
C
A: ebonic off, full data
B: ebonic off, half data
C: ebonic on, half data
language
Sentiment
All data
Ebonic off:
half data
Ebonic on:
half data
A
B
C
A: ebonic off, full data
B: ebonic off, half data
C: ebonic on, half data
sentiment
Repetition
All data
Ebonic off:
half data
Ebonic on:
half data
A
B
C
A: ebonic off, full data
B: ebonic off, half data
C: ebonic on, half data
repetition
Non-acoustic: all
All data
Ebonic off:
half data
Ebonic on:
half data
A
B
C
A: ebonic off, full data
B: ebonic off, half data
C: ebonic on, half data
Non-acoustic all
Ambiguity
Lexicon
All
Instances
Corpus
Step 1
Neutral
or
Polar?
1.
Word features
2.
3.
4.
5.
Modification features
Structure features
Sentence features
Document feature
Step 2
Polar
Instances Contextual
Polarity?
• Word token
terrifies
• Word part-of-speech
VB
• Context
that terrifies me
• Prior Polarity
negative
• Reliability
strongsubj
Lexicon
Step 1
All
Instances
Corpus
Neutral
or
Polar?
1.
Word features
2.
Modification features
3.
4.
5.
Structure features
Sentence features
Document feature
poses
subj
obj
report
det
adj
challenge
mod
The human rights
det
adj
p
a substantial …
Step 2
Polar
Instances Contextual
Polarity?
Binary features:
• Preceded by
– adjective
– adverb (other than not)
– intensifier
• Self intensifier
• Modifies
– strongsubj clue
– weaksubj clue
• Modified by
Dependency
– strongsubj clue
Parse Tree
– weaksubj clue
Lexicon
Step 1
All
Instances
Corpus
1.
2.
Word features
Modification features
3.
Structure features
4.
5.
Sentence features
Document feature
Neutral
or
Polar?
det
adj
• In copular
I am confident
• In passive voice
must be regarded
challenge
mod
The human rights
Polarity?
The human rights report poses
obj
report
Polar
Instances Contextual
Binary features:
• In subject
poses
subj
Step 2
det
adj
p
a substantial …
Lexicon
All
Instances
Corpus
Step 1
Neutral
or
Polar?
1.
2.
3.
Word features
Modification features
Structure features
4.
Sentence features
5.
Document feature
Step 2
Polar
Instances Contextual
Polarity?
• Count of strongsubj clues in
– previous, current, next sentence
• Count of weaksubj clues in
– previous, current, next sentence
• Counts of various parts of speech
Lexicon
All
Instances
Corpus
Step 1
Neutral
or
Polar?
Word features
Modification features
Structure features
Sentence features
5.
Document feature
Polar
Instances Contextual
Polarity?
• Document topic (15)
– economics
– health
…
1.
2.
3.
4.
Step 2
– Kyoto protocol
– presidential election in
Zimbabwe
Example: The disease can be contracted if a
person is bitten by a certain tick or if a person comes
into contact with the blood of a congo fever sufferer.
Download