Indo-Australia Workshop on Optimization in Human Language Technology 16th Dec 2012, IIT Patna Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com Language Change Language Change • Change in the syntactic/semantic/phonological features of a language • Perpetual, universal, directional (?) • Phonological Change: – Affects the sounds – Structured, independent of syntax/semantics – Example: Loss of consonant clusters in Hindi agni aag, dugdha dUdh, raatri raat Effects of the “Lazy Tongue” Assimilation • • • • • • in+apt = inapt in+decent = indecent in+polite = impolite in+mature = immature in+legal = illegal in+regular = irregular Deletion • • • • • cannot can’t do not don’t will not won’t are not ain’t information info Explanations for Change Exogenous causes – Language contact – Socio-political factors – Communication medium Endogenous causes – – – – Functional Phonetic error-based Frequency drifts Evolutionary Functional Explanation of Language Change • There are three evolutionary forces on any linguistic system: – Minimization of effort (energy) – Maximization of perceptual distinctiveness (Minimization of ambiguity) – Maximization of learnability Language is a perpetually evolving system shaped by these three conflicting forces Outline of the Talk • Morpho-phonological change of Bangla Verb systems and emergence of dialect diversity – Approach: Multi-Objective Constrained Optimization – Technique: Multi-Objective Genetic Algorithm (MOGA) • Understanding Computer Mediated Communication – Normalization of Texting language – Romanization of Indian Language text Geography of Bangla • Standard Colloquial Bengali (SCB) • Agartala Colloquial Bengali (ACB) • Sylhetti History of Bangla 1200 AD 1800 AD BanglaVerb Morphology করেছিলাম kar-echh-il-aam Verb root (do) Aspect (perfect) Tense (past) I had done Person (first) Cognates in the Dialects Features Classical SCB ACB Non-finite Ps,2, per. kariyA kariyAChila kore koreChilo kairA korsilo Ps,1, cont. kariteChilAm korChilAm kartAslAm root: kar (to do) Atomic Phonological Operators Deletion, Metathesis Assimilation, Mutation kariteChila Del(e/t_Ch) karitChila kariChila Del(t/_Ch) Met(ri/_Ch) kairChila korChila Asm(ao/_i) Mut(a o/_$) korChilo Hypothesis A sequence of Atomic Phonological Operators, is preferred if the verb forms obtained by application of this sequence on the classical forms have some functional benefit over the classical forms. Thus, all the modern dialects of Bangla have some functional advantage over the classical dialect. A Formal Model of Functional Explanation Unstable languages Metastable languages Impossible languages f1: Effort of articulation f2: [Acoustic distinctiveness]-1 Genetic Algorithm Gene (A string of symbols) How the solution actually looks like GA: search for good solutions mimicking nature [recombination and mutation of genes] Phenotype Lexicon consisting of 28 forms for the verb kar kori kori korChi kartAsi : : korte kartA Genotype A sequence of atomic phonological operators Del t Met ri NOP Del e Asm a Del i NOP Dsm e NOP NOP Met ri Asm a Del e NOP Genotype Phenotype Del t Met ri NOP Del e Asm a Del i NOP kari kariteChi karite kari karieChi karie kair kaireChi kaire kor korCh kor Crossover Mutation Multi-Objective GA Multi-Objective GA: Apply constraints Multi-Objective GA: Apply constraints Multi-Objective GA: Finding out good solutions Multi-Objective GA: But also keep some not-so-good solutions Multi-Objective GA: But also keep some not-so-good solutions Multi-Objective GA: After several iterations Objective functions • Articulatory effort – fe(Λ): weighted sum of number of syllables, letters and vowel height differences averaged over all words in the lexicon • Acoustic Distinctiveness – fd(Λ): Inverse of mean edit distance between words • Learnability – fr(Λ): correlation between feature match and edit distance Experiments • • • • • NSGA – II : a package for fast MOGA Gene length: 15 APOs A repertoire of 128 APOs Population: 1000, Generation: 500 6 Models with different combinations of constraints and objectives Pareto-optimal front SCB ACB Sylhetti CB Observations • vertical and horizontal limb • real dialects on the horizontal limb • Sound changes push the dialects from right to left (reduce effort) • but never up the limb • why? Role of Constraints For more information Choudhury et al., Evolution optimization and language change: the case of Bengali verb inflections, in Proceedings of ACL SIGMORPHON9, Association for Computational Linguistics, 2007 http://research.microsoft.com/people/monojitc/ MOGA and NSGA II Kanpur Genetic Algorithms Laboratory http://www.iitk.ac.in/kangal/index.shtml Food for Thought • Evaluation: – Myriads of possible dialects, but only a few observed in nature • Fixed set of pre-defined APOs – how to generalize for any change? • MOGA is an optimization tool, which in no way simulates language change – How do languages optimize themselves? Outline of the Talk • Morpho-phonological change of Bangla Verb systems and emergence of dialect diversity – Approach: Multi-Objective Constrained Optimization – Technique: Multi-Objective Genetic Algorithm (MOGA) • Understanding Computer Mediated Communication – Normalization of Texting language – Romanization of Indian Language text Computer Mediated Communication Form Texting Language • A new genre of English & also other languages used in chats, sms, emails, blogs, tweets, FB posts, comments etc. dis is n eg 4 txtin lang This is an example for Texting language Texting Language • A new genre of English & also other languages The shorter theblogs, faster etc. used in chats, sms, emails, Constraint: understandability • Ungrammatical, unconventional spellings 24 dis is n eg 4 txtin lang 39 This is an example for Texting language Analysis of Social Media • A hot topic in NLP – – – – Normalization Language identification Sentiment/Polarity detection Summarization/trend prediction Choudhury et al. (2007) Investigation and Modeling of the Structure of Texting Language. In IJCAI Workshop on Analytics of Noisy Data 2007 Tomorrow never dies!!! • • • • • • • • 2moro (9) tomoz (25) tomoro (12) tomrw (5) tom (2) tomra (2) tomorrow (24) tomora (4) • • • • • • • • tomm (1) tomo (3) tomorow (3) 2mro (2) morrow (1) tomor (2) tmorro (1) moro (1) Patterns or Compression Operators • Phonetic substitution (phoneme) – psycho syco, then den • Phonetic substitution (syllable) – today 2day , see c • Deletion of vowels – message mssg, about abt • Deletion of repeated characters – tomorrow tomorow Patterns or Compression Operators • Truncation (deletion of tails) – introduction intro, evaluation eval • Common Abbreviations – Bangalore blr, text back tb • Informal pronunciation – going to gonna, better betta HMMs for SMS Normalization S0 ε T @ ε O @ ε D @ ε A @ ε Y @ G1 ‘T’ G2 ‘O’ G3 ‘D’ G4 ‘A’ G5 ‘Y’ P2 /AH/ S1 “2” P4 /AY/ S6 Bigram Examples • TL: would b gd 2 c u some time soon • Op: would be good to see you some time soon • TL: just wanted 2 say a big thanx 4 my bday card • Op: just wanted to say a big thanks for my today card • TL: me wel i fink bein at home makes me feel a lot more stressed den bein away from it • Op: me well i think being at home makes me feel a lot more stressed deny being away from it Use of Indian Languages on Online Social Media Transliteration Spelling Change Code mixing Indian English Concluding Remarks • Languages are perpetually evolving and optimizing systems – Computational modeling of language change is still in its infancy – Lots of scope for research Thank You! monojitc@microsoft.com Questions?? Why Computational Models? Exploration Toy languages Virtual experimentation Simplified assumptions Formalization Intractable FOR AGAINST Can we model real world language change? Objectives and Constraints - 1 • Articulatory effort fe(w) = α1 fe1(w) + α2 fe2(w) + α3 fe3(w) fe1(w) = |w| fe2(w) = hr(σi) fe3(w) = |ht(Vi) - ht(Vi+1)| Objectives and Constraints - 2 • Acoustic distinctiveness fd(Λ) = (1/N) ed(wi,wj)-1 Cd(Λ) = -1 if ed(wi,wj) = 0 for > 2 pairs • Phonotactic constraints Cp(Λ) = -1 if any of the words violate the phonotactic constraints of the language Objectives and Constraints - 3 • Learnability as Regularity – fr: The correlation coefficient between the edit distance and number of matching morphological attributes for every word pair – Cr = -1 if fr > 0.8 Emergent dialects Classical D1 D2 D3 kariteChilAm kartA kariteChila kartAa kariteChilen kartAen karChi (korChi) karCha (korCha) karChen (korChen) karteChi (kartAsi) karteCha (kartAsa) karteChen (kartAsen)