Silviu Cucerzan Researcher Microsoft Corporation Browsed Document Search Box -------------------------------------------------------------------------------------------------------- heat shield ALL PREVIOUS INTERACTIONS NASA Space station Solar panels Discovery Space shuttle Space lab John Curry Atmospheric reentry Power system Peter King CBS News Associated Press Real-time user intent distribution movies news fan club biography gallery official site interviews pictures wallpaper posters imdb quotes screensavers 28 22 27 25 942 2 229 18 25 18 31 3 413 27 13 35 15 2 58 10 650 266 4 24 3 11 3 77 35 ... airplane bicycle invention printing invention bicycle press inventor light bulbinventors invention bicycle bicycle invented the bicycle airplane inventor inventers of the bicycle printing press inventor invention of bicycle light bulb inventor invention of the bicycle invention of the first bicycle inventor of the airplane inventions of the bicycle inventor printing press inventorofofthe bicycle inventor bulb inventorofofthe thelight bicycle inventorofbicycles invention of airplane the invention of the bicycle invention of of printing press the inventor the bicycle invention of light was bulbinvented when the bicycle when was the bicycle invented who invented when was theairplanes first bicycle invented who invented light bulbsinvented where was the bicycle who invented bicycle who airplane whoinvented inventedthe thefirst bicycle who bulb whoinvented inventedthe thefirst firstlight bicycle ... AskMSR who invented the bicycle? Web Encarta Text Text document Named entity recognizer Bush Surface form Mention of an entity referred to as “Bush” George W. Bush George H. W. Bush Bush, music band Reggie Bush … The entity George W. Bush: George W. Bush, George Bush, President Bush, Bush, … Text document Named entity recognizer Bush Texas George W. Bush George H. W. Bush Bush, music band Reggie Bush … Texas, US state Texas, pop band Texas, novel University of Texas at Austin … Challenges: Reference entities World knowledge Page Title First Paragraph Surface Form/Disambiguation E.g.: 385 references of Gwen Stefani in other Wikipedia articles such as 'Let Me Blow Ya Mind' by Eve and [[Gwen Stefani]] (whom he would produce … In the video ''[[Cool (song)|Cool]]'', [[Gwen Stefani]] is made-up as Monroe. … '[[South Side (song)|South Side]]' (featuring [[Gwen Stefani]]) #14 US … [[1969]] - [[Gwen Stefani]], American singer ([[No Doubt]]) … [[Rosie Gaines]], [[Carmen Electra]], [[Gwen Stefani]], [[Chuck D]], [[Angie Stone]], … In late [[2004]], [[Gwen Stefani]] released a hit song called 'Rich Girl' which … [[Gwen Stefani]] - lead singer of the band [[No Doubt]], who is now a successful … [[Social Distortion]], and [[TSOL]]. [[Gwen Stefani]], lead vocalist of the [[alternative rock]] ... main proponents (along with [[Gwen Stefani]] and [[Ashley Judd]]) in bringing back the … The [[United States|American]] singer [[Gwen Stefani]] references Harajuku in several … which also features vocals by [[No Doubt]]'s [[Gwen Stefani]]. The cover was included on … co-written by [[Eric Stefani]] and [[Gwen Stefani]] and co-produced by Matthew Wilder … … Surface Forms Entities e.g.: Texas ≈ 30 Texas Texas (TV Series) Texas (US State) University of Texas Austin USS Texas Texas (band) Texas (musical) Texas (TV Series) Texas (novel) Texas (SpongeBob episode) Texas Instruments Texas County, OK ... Tags: NBC network shows American television soaps Television spin-offs Contexts: Another World Pam Long Paul Rauch ... • the titles of entity pages the titles of redirecting pages the disambiguation pages the references to entity pages in other Wikipedia articles. the titles of entity pages • the titles of redirecting pages the disambiguation pages Another World in Texas the references to entity pages in other Wikipedia articles. Texas (TV Series) the titles of entity pages the titles of redirecting pages • the disambiguation pages the references to entity pages in other Wikipedia articles. the titles of entity pages the titles of redirecting pages the disambiguation pages Texas (TV Series) • the references to entity pages in other • Wikipedia articles articles. • List pages (“List of [...]” “Table of [...]”) 540,000 pairs Wikipedia categories 2.65 million pairs Lexicosyntactic patterns noisy List pages (“List of [...]” “Table of [...]”) 540,000 pairs • Wikipedia categories 2.65 million pairs Lexicosyntactic patterns noisy List pages (“List of [...]” “Table of [...]”) 540,000 pairs Wikipedia categories 2.65 million pairs • Lexicosyntactic patterns noisy LEX_Scotland_Music_#1 Appositives and parentheticals in the titles E.g.: Texas (TV Series) Texas, Queensland Entity references (links) Appositives and parentheticals in the titles E.g.: Texas (TV Series) Texas, Queensland • Entity references (links) Document Analysis Truecasing Named Entity Recognition Coreference Resolution Disambiguation Stage 1: Sentence boundary detection and truecasing (sentence beginnings and titles) Stage 2: Structural ambiguity resolution Conjunctions (e.g., Barnes and Noble) Possessives (e.g., Britain’s Tony Blair) Pp attachment (e.g., Whitney Museum in New York) by using Wikipedia and Web statistics: T1 Particle T2 search engine query “T1” “T2” Stage 3: 5-way named entity classification Stage 4: Coreference resolution Shorter to longer forms e.g., Brown/PERSON Michael Brown/PERSON Acronyms Web and Corpus Stats Regular Expressions Gazetteers Gazetteers Web Search Query Logs Wikipedia Entities CoNLL 2003 Statistics Gazetteers Heuristics C = {c1,…,cM} - known contexts (all surface forms and appositives/parentheticals) T = {t1,…,tN} - known category tags Text document D e1s1 ,..., e|s1( s1 )| s1 e1s1 ,..., e|s1( s1 )| si e1si ,..., eksi ,..., e|si( si )| Cksi , Tksi sj 1 sj l sj | ( s j )| e ,..., e ,..., e sj sj l C , Tl e1si ,..., eksi ,..., e|si( si )| sn d=D∩C sj e1sn ,..., e|sn( sn )| Maximize the similarity between the document context d and each entity’s contexts as well as the category tags of each entity pair. Optimization problem: n n n arg max Cei , d Tei , Te j ( e1 ,.., en ) ( s1 ).. ( sn ) i 1 i 1 j 1 j i More robust and simpler : d d T e sS ( D ) e ( s ) n arg max (C ( e1 ,.., en ) ( s1 ).. ( s n ) i 1 ei , Tei ), d (0, Tei ) arg max (Cei , Tei ), d || Tei ||2 , i 1..n ei ( si ) # category tags of ei Development: Reference: Wikipedia ver. 04/02/2006 100 news stories from MSNBC Evaluation: Reference: Wikipedia ver. 09/11/2006 Test sets: Wikipedia data: 88.3% News: 91.4% © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. Microsoft Research Faculty Summit 2007