A log-linear model for unsupervised text normalization Yi Yang and Jacob Eisenstein Georgia Institute of Technology Lexical variation I Social media language is dierent from standard English. I I : Tweet ya ur website suxx brah Standard yeah, your website sucks bro : Lexical variation I Social media language is dierent from standard English. I I I : Tweet ya ur website suxx brah Standard yeah, your website sucks bro : Word variants: phonetic substitutions, abbreviations, unconventional spellings, etc. Lexical variation I Social media language is dierent from standard English. I I I : Tweet ya ur website suxx brah Standard yeah, your website sucks bro : Word variants: phonetic substitutions, abbreviations, unconventional spellings, etc. I Normalization: map from orthographic variants to standard spellings. Lexical variation I Social media language is dierent from standard English. I I I : Tweet ya ur website suxx brah Standard yeah, your website sucks bro : Word variants: phonetic substitutions, abbreviations, unconventional spellings, etc. I Normalization: map from orthographic variants to standard spellings. I I Large label space. No labeled data. Why normalization? I Improve downstream NLP applications. I I I Part-of-speech tagging [6] Dependency parsing [11] Machine translation [7, 8] Why normalization? I Improve downstream NLP applications. I I I Part-of-speech tagging [6] Dependency parsing [11] Machine translation [7, 8] I Is there more systematic variation in social media? I I Mine patterns in how words are spelled. Discover coherent orthographic styles. Related work Method Surface Han & Baldwin edit 2011 Liu et al. 2012 Hassan et al. Context Final System language model linear LCS syntax combination distance character-based distributional decoding translation similarity language model edit random walk decoding 2013 LCS Our approach edit distance with with language model distance LCS, word pairs language model unied model with features Our approach I Unsupervised I Featurized I Context-driven I Joint Characteristics Unsupervised I Labeled data for Twitter normalization is limited. I Language changes, annotations may become stale and ill-suited to new spellings and words. Term frequency of af over months Term frequency of lml over months Characteristics Featurized: permitting overlapping features I String similarity features. I Lexical features that memorize conventionalized word pairs, e.g. you/u, to/2. Characteristics Context-driven give me suttin to believe in I Unsupervised normalization needs strong cue of local context. I String similarity needs to be overcome by contextual preference. I I Edit-Dist( Edit-Dist( , suttin) = 2 , suttin) = 5 button/suiting something Characteristics Joint gimme suttin 2 beleive innnn I No words are in the standard vocabulary. I Joint inference is needed to take advantage of context information. Normalization in a probabilistic model Maximize likelihood of observed (Twitter) text, marginalizing over normalizations. P(s ) P(ya ur website suxx brah) Normalization in a probabilistic model Maximize likelihood of observed (Twitter) text, marginalizing over normalizations. P(s ) = X t P(s , t ) P(ya ur website suxx brah) P(ya ur website . . . , yam urn website . . . ) + P(ya ur website . . . , yak your website . . . ) + P(ya ur website . . . , yea cure website . . . ) + ... A noisy channel model P(s , t ) = Pt (t )Pe (s |t ) Pe (ya ur . . . |yam urn . . . ) × Pt (yam urn website . . . ) A noisy channel model P(s , t ) = Pt (t )Pe (s |t ) I I Pe (ya ur . . . |yam urn . . . ) × Pt (yam urn website . . . ) The language model can be estimated oine. The emission model is locally normalized and log-linear. Pe (s |t ) = Y Pe (sn |tn ) n Pe (sn |tn ) = exp (θ0f (sn , tn )) Z (tn ) Pe (ya|yam)Pe (ur|urn) . . . exp (θ0f (ur, )) your Z (your) Features String similarity I Combine edit distance and longest-common subsequence (LCS) [3]: I Bin to create binary features, e.g., ( top 5(s, t) = 1, t ∈ Top 5(s) 0, otherwise Features String similarity I Combine edit distance and longest-common subsequence (LCS) [3]: I Bin to create binary features, e.g., ( top 5(s, t) = 1, t ∈ Top 5(s) 0, otherwise Word pairs I Assign weights to commonly occurring word pairs, e.g. you/u, sucks/suxx. I This allows the model to memorize frequent substitutions. Learning Our goal is to learn to weight these features. We compute the gradient of the likelihood... Learning Our goal is to learn to weight these features. We compute the gradient of the likelihood... CRF `(θ) = log P(y |x ) ∂` = f (x , y ) − Ey |x ;θ [f (x , y )] ∂θ Learning Our goal is to learn to weight these features. We compute the gradient of the likelihood... CRF Our model (MRF) `(θ) = log P(y |x ) ∂` = f (x , y ) − Ey |x ;θ [f (x , y )] ∂θ `(θ) = log P(s ) ∂` = Et |s [f (s , t ) − Es 0 |t [f (s , t )]] ∂θ Dynamic programming In a locally-normalized model, these expectations can be computed from the marginals P(tn |s 1:N ) [1]. ya ◦ ur website suxx brah yam urn website sucks bra you your suck brat yeah you're six bro ... ... ... ... ◦ Dynamic programming In a locally-normalized model, these expectations can be computed from the marginals P(tn |s 1:N ) [1]. ya ◦ ur website suxx brah yam urn website sucks bra you your suck brat yeah you're six bro ... ... ... ... I Outer expectation I Inner expectation sn at each n. ◦ Et |s : forward-backward Es 0 |t : sum over all possible Dynamic programming In a locally-normalized model, these expectations can be computed from the marginals P(tn |s 1:N ) [1]. ya ◦ ur website suxx brah yam urn website sucks bra you your suck brat yeah you're six bro ... ... ... ... I Outer expectation I Inner expectation sn I at each But Et |s : forward-backward Es 0 |t : sum over all possible n. Time complexity: 4 VT , VS > 10 ◦ O(NVT2 + NVT2 VS ) ... Sequential Monte Carlo A randomized algorithm to approximate P(t |s ) as a weighted sum, P(t |s ) ≈ X ω (k) δt (k) (t ) k 0.2 × δ(you your website . . . ) 0.6 × δ(yeah your website . . . ) P(t |ya ur website . . . ) ≈ ... 0.1 × δ(you you're website . . . ) I ecient and simple I easy to parallelize I number of samples K provides intuitive tuning between accuracy and speed Sequential Importance Sampling (SIS) At each n , for each sample k (k) I Draw tn from the proposal distribution Q(t). (k) I Update ωn so that (k) ωn(k) ∝ P(t 1:n |s 1:n ) (k) Q(t 1:n ) =... Emission model Language model = }| { z }| { z (k) (k) (k) Pe (sn |tn ) Pt (tn |tn−1 ) (k) Q(tn(k) |tn−1 , sn ) | {z } Proposal distribution (k) ωn−1 Computing the gradient from SIS I Outer expectation: Et |s [f (s , t )] = K X k (k) ωN N X n f (sn , tn(k) ) Computing the gradient from SIS I Outer expectation: Et |s [f (s , t )] = K X (k) ωN N X Inner expectation: draw 0 Et |s [Es 0 |t f (s , t )]] = K X k (sn , tn(k) ) n k I f L (k) ωN samples of N L X 1 X n L ` f sn0 |tn : (sn(`,k) , tn(k) ) Computing the gradient from SIS I Outer expectation: Et |s [f (s , t )] = K X (k) ωN N X Inner expectation: draw 0 Et |s [Es 0 |t f (s , t )]] = K X L (k) ωN k (In practice, we set (sn , tn(k) ) n k I f K = 10 samples of N L X 1 X n and L f ` L = 1) sn0 |tn : (sn(`,k) , tn(k) ) Example sn (k) tn (k) ωn (`) sn ... ya ur website yam urn website you your ... yeah you're ... ... ... ... ... Example sn (k) tn ... ya ur website yam urn website you your ... yeah you're ... ... ... ... (k) ωn (`) sn Sample (k) tn according to Q(t |◦, ya). ... Example sn (k) tn (k) ωn (`) sn ... ya ur website yam urn website you your ... yeah you're ... ... ... ... P(yeah,ya|◦) Q(yeah|◦,ya) ... Example sn (k) tn ur website yam urn website you your ... yeah you're ... ... ... ... (k) P(yeah,ya|◦) Q(yeah|◦,ya) (`) yea ωn sn ... ya Sample (l) sn according to P(s |yeah). ... Example sn ... ya ur website yam urn website you your ... yeah you're ... ... ... ... (k) P(yeah,ya|◦) Q(yeah|◦,ya) (k) P(your,ur |yeah) ω1 Q( your|yeah,ur ) (`) yea youu (k) tn ωn sn ... Example sn ... ya ur website yam urn website you your ... yeah you're ... ... ... (k) P(yeah,ya|◦) Q(yeah|◦,ya) (k) P(your,ur |yeah) ω1 Q( your|yeah,ur ) (k) website,website|your) ω2 PQ(t (website |your,website) ... (`) yea youu website ... (k) tn ωn sn ... ... Example sn ... ya ur website yam urn website you your ... yeah you're ... ... ... (k) P(yeah,ya|◦) Q(yeah|◦,ya) (k) P(your,ur |yeah) ω1 Q( your|yeah,ur ) (k) website,website|your) ω2 PQ(t (website |your,website) ... (`) yea youu website ... (k) tn ωn sn ... Update: (k) ... ωN (f (yeah, ya) − f (yeah, yea) + f (your, ur) − f (your, youu) + . . .) Example sn ... ya ur website yam urn website you your ... yeah you're ... ... ... (k) P(yeah,ya|◦) Q(yeah|◦,ya) (k) P(your,ur |yeah) ω1 Q( your|yeah,ur ) (k) website,website|your) ω2 PQ(t (website |your,website) ... (`) yea youu website ... (k) tn ωn sn ... Update: (k) ... ωN (f (yeah, ya) − f (yeah, yea) + f (your, ur) − f (your, youu) + . . .) Example sn ... ya ur website yam urn website you your ... yeah you're ... ... ... (k) P(yeah,ya|◦) Q(yeah|◦,ya) (k) P(your,ur |yeah) ω1 Q( your|yeah,ur ) (k) website,website|your) ω2 PQ(t (website |your,website) ... (`) yea youu website ... (k) tn ωn sn ... Update: (k) ... ωN (f (yeah, ya) − f (yeah, yea) + f (your, ur) − f (your, youu) + . . .) Example sn ... ya ur website yam urn website you your ... yeah you're ... ... ... (k) P(yeah,ya|◦) Q(yeah|◦,ya) (k) P(your,ur |yeah) ω1 Q( your|yeah,ur ) (k) website,website|your) ω2 PQ(t (website |your,website) ... (`) yea youu website ... (k) tn ωn sn ... Update: (k) ... ωN (f (yeah, ya) − f (yeah, yea) + f (your, ur) − f (your, youu) + . . .) Example sn ... ya ur website yam urn website you your ... yeah you're ... ... ... (k) P(yeah,ya|◦) Q(yeah|◦,ya) (k) P(your,ur |yeah) ω1 Q( your|yeah,ur ) (k) website,website|your) ω2 PQ(t (website |your,website) ... (`) yea youu website ... (k) tn ωn sn ... Update: (k) ... ωN (f (yeah, ya) − f (yeah, yea) + f (your, ur) − f (your, youu) + . . .) The proposal distribution A proposal distribution can guide sampling focuses on high-probability region. The proposal distribution A proposal distribution can guide sampling focuses on high-probability region. 1. Q(tn |sn , tn−1 ) = P(tn |sn , tn−1 ) ∝ P(sn |tn )P(tn |tn−1 ) I I Accurate (optimal!) Not fast: computing P(sn |tn ) costs O(VT VS ) The proposal distribution A proposal distribution can guide sampling focuses on high-probability region. 1. Q(tn |sn , tn−1 ) = P(tn |sn , tn−1 ) ∝ P(sn |tn )P(tn |tn−1 ) Accurate (optimal!) Not fast: computing P(sn |tn ) costs O(VT VS ) 2. Our proposal I I Q(tn |sn , tn−1 ) ∝ P(sn |tn )Z (tn )P(tn |tn−1 ) = exp (θ 0 f (sn , tn )) P(tn |tn−1 ) I I Fast: O(VS + VT ) to compute ωn(k) Fairly accurate: biased by a factor of Z (tn ) Implementation details nsupervised normalization in a LOg-Linear model unLOL: u I Decoding: propose K sequences t (k) , perform Viterbi within this limited set. I Normalization targets: I I I Training: all alphanumeric strings not in aspell Test: normalization targets are given (standard in this task) Language model: Kneser-Ney smoothed trigram model, from tweets with no OOV words I Training data: For each OOV word in the test set, draw 50 tweets from Edinburgh corpus [10] and Twitter API. Evaluation Datasets I LWWL11 [9]: 3802 isolated words I LexNorm1.1 [5]: 549 tweets with 1184 nonstandard tokens I LexNorm1.2: 172 manual corrections to LexNorm1.1 (www.cc.gatech.edu/ ~jeisenst/lexnorm.v1.2.tgz) Results Adding L1 Regularization F-measure 83 • λ = 5e − 06 • λ = 1e − 06 λ = 0 • 82 81 • λ = 1e − 05 • λ = 5e − 05 80 79 • 0 λ = 1e − 04 100000 200000 300000 number of features 400000 Analysis Can normalization help identify orthographic styles? I Use unLOL to automatically label lots of tweets I Use Levenshtein alignment between original and normalized tweets to nd substitution patterns, e.g., ng/n_ I Use matrix factorization to nd sets of patterns that are used by the same set of authors. tweets authors tweets authors unLOL normalized tweets e.g., t/_, ng/n_ aligner tweets authors unLOL normalized tweets character patterns aligner tweets authors unLOL character patterns normalized tweets authorpattern matrix aligner authors unLOL normalized tweets style dictionaries authorpattern matrix NMF author loadings tweets character patterns Observations Some styles mirror phonological variables: g-dropping, (TH)-stopping, t-deletion... style g-dropping t-deletion th-stopping rules g*/_* ng/n_ g /_ t*/_* st/s_ t/_ h/_ *t/*d th/d_ t/d [4] examples goin, talkin, watchin, feelin, makin jus, bc, shh, wha, gota, wea, mus, firts, jes, subsistutes dat, de, skool, fone, dese, dha, shid, dhat, dat’s Observations Others are known netspeak phenomena, like expressive lengthening [2] style (kd)-adding o-adding e-adding rules i_/id _/k _/d _*/k* o_/oo _*/o* _/o _/i e_/ee _/e _*/e* examples idk, fuckk, okk, backk, workk, badd, andd, goodd, bedd, elidgible, pidgeon soo, noo, doo, oohh, loove, thoo, helloo mee, ive, retweet, bestie, lovee, nicee, heey, likee, iphone, homie, ii, damnit Observations We get meaningful styles even from mistaken normalizations! (e.g., i'm/ima, out/outta) style a-adding rules _/a __/ma _/m _*/a* examples ima, outta, needa, shoulda, woulda, mm, comming, tomm, boutt, ppreciate Summary I unLOL: joint, unsupervised, and log-linear I I I I balance between edit distance and conventionalized word pairs. Main challenge: massive label space, in which quadratic algorithms are not practical. Solution: approximate gradient with SMC Proposal distribution focuses the samples on the high-likelihood part of the search space. Features I Normalization enables the study of systematic orthographic variation. References I T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein. Painless unsupervised learning with features. In Proceedings of NAACL, pages 582590, 2010. S. Brody and N. Diakopoulos. Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: using word lengthening to detect sentiment in microblogs. In Proceedings of EMNLP, 2011. D. Contractor, T. A. Faruquie, and L. V. Subramaniam. Unsupervised cleansing of noisy text. In Proceedings of COLING, pages 189196, 2010. J. Eisenstein. Phonological factors in social media writing. Proceedings of the NAACL Workshop on Language Analysis in Social Media, 2013. In References II B. Han and T. Baldwin. Lexical normalisation of short text messages: makn sens a #twitter. In Proceedings of ACL, pages 368378, 2011. B. Han, P. Cook, and T. Baldwin. Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology, 4(1):5, 2013. H. Hassan and A. Menezes. Social text normalization using contextual graph random walks. In Proceedings of ACL, 2013. References III W. Ling, C. Dyer, I. Trancoso, and A. Black. Paraphrasing 4 microblog normalization. Proceedings of the 2013 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Seattle, USA, October 2013. Association for In Computational Linguistics. F. Liu, F. Weng, B. Wang, and Y. Liu. Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision. In Proceedings of ACL, pages 7176, 2011. S. Petrovi¢, M. Osborne, and V. Lavrenko. The edinburgh twitter corpus. Proceedings of the NAACL HLT Workshop on Computational Linguistics in a World of Social Media, pages 2526, 2010. In References IV C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li. Adaptive parser-centric text normalization. In Proceedings of ACL, pages 11591168, 2013.