NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia Ming Zhou Manager of Natural Language Group Microsoft Research Asia Trends of Internet Services • Eco system to work with third party’s apps – Apple Apps, Facebook, Twitter, Baidu, Sina, QQ • Real time content collection and search – Twitter, Facebook, Del.ici.ous, NYT, YouTube • Mobile search – Contextual intent understanding – Towards decision making and action taking • Social power – Social tags (like) for general search engines – Search engines in SNS – Social QA Impact and Challenge to NLP Research • Impact – – – – Biggest database ever – connects data Biggest social network – connects people Harnessing collective intelligence Contextual information processing: User, user’s social network, location, time – Real-time information processing: Collection, index, operation without delay • Challenge – How to leverage data, people, contextual information to reach real-time information processing? Problems of Traditional NLP Approaches (NLP 1.0) • Deep in individual component technologies but reach upper bounds • Less consider scenarios, user’s need, market need • Serious data sparseness with human annotation • Evaluation bottleneck • Slow deployment • Lack effective framework to involve users’ feedback 4 New Strategy of NLP (NLP2.0) • • • • Data collection from the web Domain specific and open-IE Contextual NLP Maximize on the system level not on the individual component • Earlier deployment on Internet • Make best use of social factors 5 Our Vision and Task Understand user and document in any language, for any device and any applications • Advanced NLP technologies – Word breaker, POS tagging, chunking, syntactic parser, semantic role labeling, speller, query suggestion, summarization – Chinese, Japanese, English • Multi-language information access – Statistical machine translation – Multi-language search • Semantic computing – Sentiment analysis, event extraction, ontology learning – Understanding query intent and document – Contextual NLP MSRA NLP Research Overview Applications Chinese IME English writing wizard News Search Comparison Shopping Japanese IME Pocket translator Twitter Search Chatbot Query speller Couplet generation Resume Routing General web search Text analysis Machine Translation Component techs Skeleton parser Translation evaluation Named entity identification Tran. know. acquisition Pos tagging Information Extraction Information Retrieval Meta data extraction paraphrasing Term extraction WEB mining for MT Annotation tool SMT Machine learning Vertical search Cross language IR NLP enriched Indexing SLM NLP (C, J, E) MT (C, J, E) MRD Data MRD Parsing lexicon IR and IE (C,J,E) Bilingual corpus Balanced corpus Tagged corpus and search Query-doc relevance Translation Bilingual tagged lexicon corpus Text mining Research Accomplishment • Awards – – – – MSRA Best Research Team(2010) Finalist of WSJ Asian Innovation Awards (2010) MS ARD Best Project (Engkoo) MSRA Best Innovation (1998-2008): IME and Chinese couplets – – – – – Best result in NIST 2008 SMT, CWMT 2008 and CWMT 2009 Best result in SIGHAN 2006 bake off on Chinese word segmentation Best result in cross language information retrieval in TREC-9, NTCIR-III 40 ACL papers, 9 SIGIR, 17 Coling papers (2000-2010) PC Chair, area chair of ACL • Academic impact • Collaboration with universities – HIT Joint lab on NLP, Speech and Search, Tsinghua Joint lab on Media and Network – 400 interns in 12 years – Summer schools since 2001 – PhD supervisors at universities 8 Summer School on Information Extraction (Harbin, June, 2005) Cheng Niu: Information extraction Frank Seide: Speech information extraction and search Hwee Tou Ng: Advanced topics of information extraction Chin-Yew Lin: Information extraction for automatic summarization Projects based on NLP 2.0 • Engkoo: Web-based English learning service – Data mining from the web • Chinese couplets – Include user’s power into system evolvement • Semantic analysis and search of microblogging – Move to SNS, mobile Engkoo Parallel data mining from the web Video: http://video.sina.com.cn/v/b/37417609-1286528122.html Rapidly Changing Language • Approximately 1.5 billion people speak English as a primary, secondary or business language • China: The largest “English speaking” country with 250 million English learners and USD 60 billion annual expenses • Problem: Live language: new words, new meanings Key Insight: With billions of translated web pages and sharable repositories of language data growing every day, the Internet holds the sum of human language knowledge www.engkoo.com Major Features: Endless Lexicon with Native Definitions Microsoft Products: Bing Human-Like TTS & Phonetic Search Office State-of-the-Art Machine Translation (NIST OpenMT Winner) Real-time Interactive Alignment MSN Massive Dictionary Mined from the Web Fresh and Diverse Examples Advanced Search with Sentence Analysis Sentences Classification Learn Contextual Usage with Word Alignment Learn Contextual Usage with Word Alignment Learn Contextual Usage with Word Alignment Hints of Easy-Confused Words 1. word’s idiomatic usage • Verb~Noun (decline~offer) • Verb~Adv (greatly~improve) • Adj~Noun (arduous~task) • Adv~Adj (extremely~bad) Knowlege Mining Pipeline Web Mining 2. paraphrasing tokenizing: he could hardly afford to that golden time. • waste turn_on~light, switch_on~light 他 无法 浪费 那样• 的 好 时光。 laborious~task, hard~task skeleton parsing: (Tsub~he~afford) (ModAdv~hardly~afford) (Tobj~waste~afford) • deeply~moved, deeply~touched (Tobj~time~waste) (AdjAttrib~golden~time) 1. single word (Tsub~他~浪费) (ModAdv~无法 ~ 浪费 )(Tobj~ 浪费~时光 ) etc. 3. collocation translations “he”, “could”, “hardly”, “afford” Parallel Sentence: (AdjAttrib~好 ~ 时光 ) • “他” 订,~“计划 ,make~plan 无法” , ”浪费“ etc. He could hardly afford to waste that golden time. alignment: he(他) could hardly afford to(无法)•waste( 浪费 ) that( 那样的 ) ~旅馆 , book~room 2. 订 single word with its POS 他无法浪费那样的好时光。 golden(好) “he_Pron”, time(时光•)“could_Verb”,“hardly_Adv” 订~杂志 , etc. Machine Translation Model subscribe to ~magazine “他_Pron”, “无法_Adv”, ”浪费_Verb“ etc. Paraphrasing Model 3. collocation “Tsub~he~afford ”, “Tobj~time~waste” etc. “Tsub~他~浪费”, “ModAdv~无法~浪费” etc. Mined Data Linguistic Parsing Parsed Data Knowledge Mining Linguistic Knowledge Indexed Data Multilevel Indexing Chinese Couplets Include user‘s power into system evolvement Chinese Couplets (http://duilian.msra.cn) http://video.sina.com.cn/v/b/10937201-1452530713.html FS and SS Share the Same Style Repetition of pronunciations(音韵联) 风 (wind)----------------水 (water) 吹 (blow) ---------------使 (make) 荞(buckwheat) -- ------舟 (ship) 动(wave)----------------流 (go) 桥 (bridge) -------------洲 (island) 未 (not) -----------------不 (not) 动(wave) ---------------流(go) FS and SS Share the Same Style Decomposition of characters (拆字联) 有 (have)----------------- 缺 (lack) 子 (son) -------------------鱼 (fish) 有 (have) ------------------缺 (lack) 女 (daughter)-------------羊 (mutton) 方 (so) ---------------------敢 (dare) 称 (call) --------------------叫 (call) 好(good) -------------------鲜(fresh) 好 女 子 鲜 鱼 羊 FS and SS Share the Same Style Person name (人名联) Palindrome (回文联) 板桥(Banqiao)---------------- 东坡 (Dongpo) 造(produce) -------------------居 (live) 桥(bridge) ---------------------坡 (mountain) 板(board)----------------------东(east) •Banqiao(板桥) and Dongpo(东坡) are famous litterateurs •Reading from top to down is identical to down to top SS Generation Process 海 Sea 阔 wide 山 hill 高 high 天 sky 深 deep 天高 sky high 山高 hill high 凭 allow 任 permit 倚 depend 鱼 fish 跃 jump 虫 insect 飞 fly 鸟 bird 舞 dance 虎 tiger 鸣 tweedle 鸟飞 bird fly 虎啸 tiger roar SMT decoding Linguistic filtering Reranking 山高任鸟飞 天高任鸟鸣 天高任鸟飞 山高靠虎啸 山高任虎啸 山深任鸟飞 天高任花香 …… 山高任鸟飞 天高任鸟鸣 天高任鸟飞 山深任鸟飞 天高任花香 天高任鸟舞 山高任花香 …… 天高任鸟飞 山高任鸟飞 天高任鸟鸣 天高任鸟舞 山深任鸟飞 山高任花香 天高任花香 …… SS Generation Approach FS input • A multi-phase SMT approach Phrase-based loglinear model – Phase1: a phrase-based log-linear model N-best candidates – Phase2: some linguistic filters Linguistic filters – Phase3: a Ranking SVM Ranking SVM model SS output Great Examples • FS:月落乌啼霜满天 • SS:风吹雁过雨连宵 • FS:千江有水千江月 • SS:万里无云万里星 • FS:秦淮河桨声灯影 • SS:松花江水色月光 • FS:此木为柴山山出 (此+木=柴;山+山=出) • SS:白水作泉日日昌 (白+水=泉;日+日=昌) User log for Model Enhancement • Motivation – Training data is not adequate – While user log is big(60k/m), increasing, diverse • What logs we record – User inputs – User finalized couplets • Second sentences selected out of the candidates provided by our system • User modified second sentences User’s Log Analysis Number of input sentences 12,322 Number of unique input sentences 6,698 Users directly select from system output 3,459 User manual modify system output 606 Save as favorite couplets 109 Invalid user input 618 No second sentence generated 2,211 Banner generation 2,687 Select the generated banner as favorite 428 No banner output 265 Data Source Log from http://couplet.msra. cn Time period Aug. 31-Oct. 9, 2006 New Framework with Log Data First sentence input Translation model Source-Channel model Language model Training data Translation model Language model Log data N-best candidates Mutual informatio n Re-ranking Second sentence output Mutual informatio n User operation Twitter Search Move to social internet and mobile Tweets A collection of tweets Tweets Cluster News & Images Link Extraction User Influence Measure Hot tag, topic Extraction Popular Tweet Extraction Top video, music, artists Extraction Individual tweet Semantic Role Labeling NE Recognition Raw Data Sentence Boundary Detection Sentiment Analysis Dependency Parsing Text Normalization Co-reference Classification Semantic Search Community Extraction Multi-level Indexing Noise Filtering Statistical Relationship Learning Conclusion • • • • • Internet trends and impacts to NLP NLP2.0 strategy Web data mining: Engkoo User’s power: Couplets SNS and mobile: Twitter search