A log-linear model for unsupervised text normalization Yi Yang and Jacob Eisenstein

advertisement
A log-linear model for unsupervised
text normalization
Yi Yang and Jacob Eisenstein
Georgia Institute of Technology
Lexical variation
I Social media language is dierent from standard
English.
I
I
:
Tweet
ya ur website suxx brah
Standard yeah, your website sucks bro
:
Lexical variation
I Social media language is dierent from standard
English.
I
I
I
:
Tweet
ya ur website suxx brah
Standard yeah, your website sucks bro
:
Word variants: phonetic substitutions,
abbreviations, unconventional spellings, etc.
Lexical variation
I Social media language is dierent from standard
English.
I
I
I
:
Tweet
ya ur website suxx brah
Standard yeah, your website sucks bro
:
Word variants: phonetic substitutions,
abbreviations, unconventional spellings, etc.
I
Normalization: map from orthographic
variants to standard spellings.
Lexical variation
I Social media language is dierent from standard
English.
I
I
I
:
Tweet
ya ur website suxx brah
Standard yeah, your website sucks bro
:
Word variants: phonetic substitutions,
abbreviations, unconventional spellings, etc.
I
Normalization: map from orthographic
variants to standard spellings.
I
I
Large label space.
No labeled data.
Why normalization?
I Improve downstream NLP applications.
I
I
I
Part-of-speech tagging [6]
Dependency parsing [11]
Machine translation [7, 8]
Why normalization?
I Improve downstream NLP applications.
I
I
I
Part-of-speech tagging [6]
Dependency parsing [11]
Machine translation [7, 8]
I Is there more systematic variation in social
media?
I
I
Mine patterns in how words are spelled.
Discover coherent orthographic styles.
Related work
Method
Surface
Han & Baldwin
edit
2011
Liu et al. 2012
Hassan
et
al.
Context
Final System
language model
linear
LCS
syntax
combination
distance
character-based
distributional
decoding
translation
similarity
language model
edit
random walk
decoding
2013
LCS
Our approach
edit
distance
with
with
language model
distance
LCS, word pairs
language model
unied model
with features
Our approach
I
Unsupervised
I
Featurized
I
Context-driven
I
Joint
Characteristics
Unsupervised
I Labeled data for Twitter normalization is limited.
I Language changes, annotations may become
stale and ill-suited to new spellings and words.
Term frequency of af over months
Term frequency of lml over months
Characteristics
Featurized: permitting overlapping features
I String similarity features.
I Lexical features that memorize conventionalized
word pairs, e.g. you/u, to/2.
Characteristics
Context-driven
give me suttin to believe in
I Unsupervised normalization needs strong cue of
local context.
I String similarity needs to be overcome by
contextual preference.
I
I
Edit-Dist(
Edit-Dist(
, suttin) = 2
, suttin) = 5
button/suiting
something
Characteristics
Joint
gimme suttin 2 beleive innnn
I No words are in the standard vocabulary.
I Joint inference is needed to take advantage of
context information.
Normalization in a probabilistic model
Maximize likelihood of observed (Twitter) text,
marginalizing over normalizations.
P(s )
P(ya ur website suxx brah)
Normalization in a probabilistic model
Maximize likelihood of observed (Twitter) text,
marginalizing over normalizations.
P(s )
=
X
t
P(s , t )
P(ya ur website suxx brah)
P(ya ur website . . . , yam
urn website . . .
)
+ P(ya ur website . . . , yak your website . . . )
+ P(ya ur website . . . , yea cure website . . . )
+ ...
A noisy channel model
P(s , t ) = Pt (t )Pe (s |t )
Pe (ya ur . . . |yam urn . . . )
× Pt (yam urn website . . . )
A noisy channel model
P(s , t ) = Pt (t )Pe (s |t )
I
I
Pe (ya ur . . . |yam urn . . . )
× Pt (yam urn website . . . )
The language model can be estimated oine.
The emission model is locally normalized and log-linear.
Pe (s |t ) =
Y
Pe (sn |tn )
n
Pe (sn |tn ) =
exp (θ0f (sn , tn ))
Z (tn )
Pe (ya|yam)Pe (ur|urn) . . .
exp (θ0f (ur,
))
your
Z (your)
Features
String similarity
I Combine edit distance and longest-common
subsequence (LCS) [3]:
I Bin to create binary features, e.g.,
(
top 5(s, t) =
1,
t ∈ Top 5(s)
0,
otherwise
Features
String similarity
I Combine edit distance and longest-common
subsequence (LCS) [3]:
I Bin to create binary features, e.g.,
(
top 5(s, t) =
1,
t ∈ Top 5(s)
0,
otherwise
Word pairs
I Assign weights to commonly occurring word
pairs, e.g. you/u, sucks/suxx.
I This allows the model to memorize frequent
substitutions.
Learning
Our goal is to learn to weight these features.
We compute the gradient of the likelihood...
Learning
Our goal is to learn to weight these features.
We compute the gradient of the likelihood...
CRF
`(θ) = log P(y |x )
∂`
= f (x , y ) − Ey |x ;θ [f (x , y )]
∂θ
Learning
Our goal is to learn to weight these features.
We compute the gradient of the likelihood...
CRF
Our model (MRF)
`(θ) = log P(y |x )
∂`
= f (x , y ) − Ey |x ;θ [f (x , y )]
∂θ
`(θ) = log P(s )
∂`
= Et |s [f (s , t ) − Es 0 |t [f (s , t )]]
∂θ
Dynamic programming
In a locally-normalized model, these expectations can be
computed from the marginals P(tn |s 1:N ) [1].
ya
◦
ur
website
suxx
brah
yam
urn
website
sucks
bra
you
your
suck
brat
yeah
you're
six
bro
...
...
...
...
◦
Dynamic programming
In a locally-normalized model, these expectations can be
computed from the marginals P(tn |s 1:N ) [1].
ya
◦
ur
website
suxx
brah
yam
urn
website
sucks
bra
you
your
suck
brat
yeah
you're
six
bro
...
...
...
...
I
Outer expectation
I
Inner expectation
sn
at each
n.
◦
Et |s : forward-backward
Es 0 |t : sum over all possible
Dynamic programming
In a locally-normalized model, these expectations can be
computed from the marginals P(tn |s 1:N ) [1].
ya
◦
ur
website
suxx
brah
yam
urn
website
sucks
bra
you
your
suck
brat
yeah
you're
six
bro
...
...
...
...
I
Outer expectation
I
Inner expectation
sn
I
at each
But
Et |s : forward-backward
Es 0 |t : sum over all possible
n.
Time complexity:
4
VT , VS > 10
◦
O(NVT2 + NVT2 VS )
...
Sequential Monte Carlo
A randomized algorithm to approximate
P(t |s )
as a
weighted sum,
P(t |s ) ≈
X
ω (k) δt (k) (t )
k

0.2 × δ(you your website . . . )



0.6 × δ(yeah your website . . . )
P(t |ya ur website . . . ) ≈
...


0.1 × δ(you you're website . . . )
I ecient and simple
I easy to parallelize
I number of samples
K
provides intuitive tuning
between accuracy and speed
Sequential Importance Sampling (SIS)
At each n , for each sample k
(k)
I Draw tn
from the proposal distribution
Q(t).
(k)
I Update ωn
so that
(k)
ωn(k)
∝
P(t 1:n |s 1:n )
(k)
Q(t 1:n )
=...
Emission model Language model
=
}|
{
z }| { z
(k)
(k) (k)
Pe (sn |tn ) Pt (tn |tn−1 )
(k)
Q(tn(k) |tn−1 , sn )
|
{z
}
Proposal distribution
(k)
ωn−1
Computing the gradient from SIS
I
Outer expectation:
Et |s [f (s , t )] =
K
X
k
(k)
ωN
N
X
n
f
(sn , tn(k) )
Computing the gradient from SIS
I
Outer expectation:
Et |s [f (s , t )] =
K
X
(k)
ωN
N
X
Inner expectation: draw
0
Et |s [Es 0 |t f (s , t )]] =
K
X
k
(sn , tn(k) )
n
k
I
f
L
(k)
ωN
samples of
N
L
X
1 X
n
L
`
f
sn0 |tn :
(sn(`,k) , tn(k) )
Computing the gradient from SIS
I
Outer expectation:
Et |s [f (s , t )] =
K
X
(k)
ωN
N
X
Inner expectation: draw
0
Et |s [Es 0 |t f (s , t )]] =
K
X
L
(k)
ωN
k
(In practice, we set
(sn , tn(k) )
n
k
I
f
K = 10
samples of
N
L
X
1 X
n
and
L
f
`
L = 1)
sn0 |tn :
(sn(`,k) , tn(k) )
Example
sn
(k)
tn
(k)
ωn
(`)
sn
...
ya
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
...
...
Example
sn
(k)
tn
...
ya
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
...
(k)
ωn
(`)
sn
Sample
(k)
tn
according to
Q(t |◦, ya).
...
Example
sn
(k)
tn
(k)
ωn
(`)
sn
...
ya
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
...
P(yeah,ya|◦)
Q(yeah|◦,ya)
...
Example
sn
(k)
tn
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
...
(k)
P(yeah,ya|◦)
Q(yeah|◦,ya)
(`)
yea
ωn
sn
...
ya
Sample
(l)
sn
according to
P(s |yeah).
...
Example
sn
...
ya
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
...
(k)
P(yeah,ya|◦)
Q(yeah|◦,ya)
(k) P(your,ur |yeah)
ω1 Q(
your|yeah,ur )
(`)
yea
youu
(k)
tn
ωn
sn
...
Example
sn
...
ya
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
(k)
P(yeah,ya|◦)
Q(yeah|◦,ya)
(k) P(your,ur |yeah)
ω1 Q(
your|yeah,ur )
(k)
website,website|your)
ω2 PQ(t (website
|your,website)
...
(`)
yea
youu
website
...
(k)
tn
ωn
sn
...
...
Example
sn
...
ya
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
(k)
P(yeah,ya|◦)
Q(yeah|◦,ya)
(k) P(your,ur |yeah)
ω1 Q(
your|yeah,ur )
(k)
website,website|your)
ω2 PQ(t (website
|your,website)
...
(`)
yea
youu
website
...
(k)
tn
ωn
sn
...
Update:
(k)
...
ωN (f (yeah, ya) − f (yeah, yea)
+ f (your, ur) − f (your, youu) + . . .)
Example
sn
...
ya
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
(k)
P(yeah,ya|◦)
Q(yeah|◦,ya)
(k) P(your,ur |yeah)
ω1 Q(
your|yeah,ur )
(k)
website,website|your)
ω2 PQ(t (website
|your,website)
...
(`)
yea
youu
website
...
(k)
tn
ωn
sn
...
Update:
(k)
...
ωN (f (yeah, ya) − f (yeah, yea)
+ f (your, ur) − f (your, youu) + . . .)
Example
sn
...
ya
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
(k)
P(yeah,ya|◦)
Q(yeah|◦,ya)
(k) P(your,ur |yeah)
ω1 Q(
your|yeah,ur )
(k)
website,website|your)
ω2 PQ(t (website
|your,website)
...
(`)
yea
youu
website
...
(k)
tn
ωn
sn
...
Update:
(k)
...
ωN (f (yeah, ya) − f (yeah, yea)
+ f (your, ur) − f (your, youu) + . . .)
Example
sn
...
ya
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
(k)
P(yeah,ya|◦)
Q(yeah|◦,ya)
(k) P(your,ur |yeah)
ω1 Q(
your|yeah,ur )
(k)
website,website|your)
ω2 PQ(t (website
|your,website)
...
(`)
yea
youu
website
...
(k)
tn
ωn
sn
...
Update:
(k)
...
ωN (f (yeah, ya) − f (yeah, yea)
+ f (your, ur) − f (your, youu) + . . .)
Example
sn
...
ya
ur
website
yam
urn
website
you
your
...
yeah
you're
...
...
...
(k)
P(yeah,ya|◦)
Q(yeah|◦,ya)
(k) P(your,ur |yeah)
ω1 Q(
your|yeah,ur )
(k)
website,website|your)
ω2 PQ(t (website
|your,website)
...
(`)
yea
youu
website
...
(k)
tn
ωn
sn
...
Update:
(k)
...
ωN (f (yeah, ya) − f (yeah, yea)
+ f (your, ur) − f (your, youu) + . . .)
The proposal distribution
A proposal distribution can guide sampling focuses
on high-probability region.
The proposal distribution
A proposal distribution can guide sampling focuses
on high-probability region.
1.
Q(tn |sn , tn−1 ) = P(tn |sn , tn−1 ) ∝ P(sn |tn )P(tn |tn−1 )
I
I
Accurate (optimal!)
Not fast: computing P(sn |tn ) costs O(VT VS )
The proposal distribution
A proposal distribution can guide sampling focuses
on high-probability region.
1.
Q(tn |sn , tn−1 ) = P(tn |sn , tn−1 ) ∝ P(sn |tn )P(tn |tn−1 )
Accurate (optimal!)
Not fast: computing P(sn |tn ) costs O(VT VS )
2. Our proposal
I
I
Q(tn |sn , tn−1 ) ∝ P(sn |tn )Z (tn )P(tn |tn−1 )
= exp (θ 0 f (sn , tn )) P(tn |tn−1 )
I
I
Fast: O(VS + VT ) to compute ωn(k)
Fairly accurate: biased by a factor of Z (tn )
Implementation details
nsupervised normalization in a LOg-Linear model
unLOL: u
I
Decoding: propose
K
sequences t
(k)
, perform
Viterbi within this limited set.
I
Normalization targets:
I
I
I
Training: all alphanumeric strings not in aspell
Test: normalization targets are given
(standard in this task)
Language model: Kneser-Ney smoothed
trigram model, from tweets with no OOV words
I
Training data: For each OOV word in the test
set, draw 50 tweets from Edinburgh corpus [10]
and Twitter API.
Evaluation
Datasets
I LWWL11 [9]: 3802 isolated words
I LexNorm1.1 [5]: 549 tweets with 1184
nonstandard tokens
I LexNorm1.2: 172 manual corrections to
LexNorm1.1 (www.cc.gatech.edu/
~jeisenst/lexnorm.v1.2.tgz)
Results
Adding L1 Regularization
F-measure
83
•
λ = 5e − 06
•
λ = 1e − 06 λ = 0
•
82
81
•
λ = 1e − 05
•
λ = 5e − 05
80
79
•
0
λ = 1e − 04
100000
200000
300000
number of features
400000
Analysis
Can normalization help identify orthographic styles?
I Use unLOL to automatically label lots of tweets
I Use Levenshtein alignment between original and
normalized tweets to nd substitution patterns,
e.g., ng/n_
I Use matrix factorization to nd sets of patterns
that are used by the same set of authors.
tweets
authors
tweets
authors
unLOL
normalized
tweets
e.g., t/_, ng/n_
aligner
tweets
authors
unLOL
normalized
tweets
character
patterns
aligner
tweets
authors
unLOL
character
patterns
normalized
tweets
authorpattern
matrix
aligner
authors
unLOL
normalized
tweets
style dictionaries
authorpattern
matrix
NMF
author
loadings
tweets
character
patterns
Observations
Some styles mirror phonological variables:
g-dropping, (TH)-stopping, t-deletion...
style
g-dropping
t-deletion
th-stopping
rules
g*/_*
ng/n_
g /_
t*/_*
st/s_
t/_
h/_
*t/*d
th/d_
t/d
[4]
examples
goin, talkin, watchin, feelin, makin
jus, bc, shh, wha, gota, wea, mus,
firts, jes, subsistutes
dat, de, skool, fone, dese, dha, shid,
dhat, dat’s
Observations
Others are known
netspeak phenomena, like expressive lengthening [2]
style
(kd)-adding
o-adding
e-adding
rules
i_/id
_/k _/d
_*/k*
o_/oo
_*/o*
_/o
_/i
e_/ee
_/e
_*/e*
examples
idk, fuckk, okk, backk, workk, badd,
andd, goodd, bedd, elidgible, pidgeon
soo, noo, doo, oohh, loove, thoo,
helloo
mee, ive, retweet, bestie, lovee,
nicee, heey, likee, iphone, homie, ii,
damnit
Observations
We get meaningful styles even from mistaken
normalizations! (e.g., i'm/ima, out/outta)
style
a-adding
rules
_/a
__/ma
_/m
_*/a*
examples
ima, outta, needa, shoulda, woulda,
mm, comming, tomm, boutt, ppreciate
Summary
I unLOL: joint, unsupervised, and log-linear
I
I
I
I
balance between edit distance and
conventionalized word pairs.
Main challenge: massive label space, in which
quadratic algorithms are not practical.
Solution: approximate gradient with SMC
Proposal distribution focuses the samples on
the high-likelihood part of the search space.
Features
I Normalization enables the study of systematic
orthographic variation.
References I
T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein.
Painless unsupervised learning with features.
In
Proceedings of NAACL,
pages 582590, 2010.
S. Brody and N. Diakopoulos.
Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: using word lengthening to
detect sentiment in microblogs.
In
Proceedings of EMNLP,
2011.
D. Contractor, T. A. Faruquie, and L. V. Subramaniam.
Unsupervised cleansing of noisy text.
In
Proceedings of COLING,
pages 189196, 2010.
J. Eisenstein.
Phonological factors in social media writing.
Proceedings of the NAACL Workshop on Language Analysis in
Social Media, 2013.
In
References II
B. Han and T. Baldwin.
Lexical normalisation of short text messages: makn sens a #twitter.
In
Proceedings of ACL,
pages 368378, 2011.
B. Han, P. Cook, and T. Baldwin.
Lexical normalization for social media text.
ACM Transactions on Intelligent Systems and Technology,
4(1):5,
2013.
H. Hassan and A. Menezes.
Social text normalization using contextual graph random walks.
In
Proceedings of ACL,
2013.
References III
W. Ling, C. Dyer, I. Trancoso, and A. Black.
Paraphrasing 4 microblog normalization.
Proceedings of the 2013 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural
Language Learning, Seattle, USA, October 2013. Association for
In
Computational Linguistics.
F. Liu, F. Weng, B. Wang, and Y. Liu.
Insertion, deletion, or substitution?: normalizing text messages
without pre-categorization nor supervision.
In
Proceedings of ACL,
pages 7176, 2011.
S. Petrovi¢, M. Osborne, and V. Lavrenko.
The edinburgh twitter corpus.
Proceedings of the NAACL HLT Workshop on Computational
Linguistics in a World of Social Media, pages 2526, 2010.
In
References IV
C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li.
Adaptive parser-centric text normalization.
In
Proceedings of ACL,
pages 11591168, 2013.
Download