Silviu Cucerzan Researcher Microsoft Corporation

advertisement
Silviu Cucerzan
Researcher
Microsoft Corporation
Browsed Document
Search Box
--------------------------------------------------------------------------------------------------------
heat shield
ALL PREVIOUS INTERACTIONS
NASA
Space station
Solar panels
Discovery
Space shuttle
Space lab
John Curry
Atmospheric reentry
Power system
Peter King
CBS News
Associated Press
Real-time
user intent
distribution
movies
news
fan club
biography
gallery
official site
interviews
pictures
wallpaper
posters
imdb
quotes
screensavers
28
22
27
25
942
2
229
18
25
18
31
3
413
27
13
35
15
2
58
10
650
266
4
24
3
11
3
77
35
...
airplane
bicycle invention
printing
invention
bicycle press
inventor
light
bulbinventors
invention
bicycle
bicycle
invented the bicycle
airplane
inventor
inventers
of the bicycle
printing
press
inventor
invention
of bicycle
light
bulb inventor
invention
of the bicycle
invention of the first bicycle
inventor
of the
airplane
inventions
of the
bicycle
inventor
printing press
inventorofofthe
bicycle
inventor
bulb
inventorofofthe
thelight
bicycle
inventorofbicycles
invention
of airplane
the
invention
of the bicycle
invention
of of
printing
press
the
inventor
the bicycle
invention
of light was
bulbinvented
when
the bicycle
when was the bicycle invented
who
invented
when
was theairplanes
first bicycle invented
who
invented
light
bulbsinvented
where
was the
bicycle
who invented bicycle
who
airplane
whoinvented
inventedthe
thefirst
bicycle
who
bulb
whoinvented
inventedthe
thefirst
firstlight
bicycle
...
AskMSR  who invented the bicycle?
Web
Encarta Text
Text document
Named entity
recognizer
Bush
Surface form
Mention of an entity
referred to as “Bush”
George W. Bush
George H. W. Bush
Bush, music band
Reggie Bush
…
The entity George W. Bush: George W. Bush, George Bush, President Bush, Bush, …
Text document
Named entity
recognizer
Bush
Texas
George W. Bush
George H. W. Bush
Bush, music band
Reggie Bush
…
Texas, US state
Texas, pop band
Texas, novel
University of Texas at Austin
…
Challenges:
Reference entities
World knowledge
Page Title
First Paragraph
Surface Form/Disambiguation
E.g.: 385 references of Gwen Stefani in other Wikipedia articles
such as 'Let Me Blow Ya Mind' by Eve and [[Gwen Stefani]] (whom he would produce …
In the video ''[[Cool (song)|Cool]]'', [[Gwen Stefani]] is made-up as Monroe. …
'[[South Side (song)|South Side]]' (featuring [[Gwen Stefani]]) #14 US …
[[1969]] - [[Gwen Stefani]], American singer ([[No Doubt]]) …
[[Rosie Gaines]], [[Carmen Electra]], [[Gwen Stefani]], [[Chuck D]], [[Angie Stone]], …
In late [[2004]], [[Gwen Stefani]] released a hit song called 'Rich Girl' which …
[[Gwen Stefani]] - lead singer of the band [[No Doubt]], who is now a successful …
[[Social Distortion]], and [[TSOL]]. [[Gwen Stefani]], lead vocalist of the [[alternative rock]] ...
main proponents (along with [[Gwen Stefani]] and [[Ashley Judd]]) in bringing back the …
The [[United States|American]] singer [[Gwen Stefani]] references Harajuku in several …
which also features vocals by [[No Doubt]]'s [[Gwen Stefani]]. The cover was included on …
co-written by [[Eric Stefani]] and [[Gwen Stefani]] and co-produced by Matthew Wilder …
…
Surface Forms
Entities
e.g.: Texas
≈ 30
Texas
Texas (TV Series)
Texas (US State)
University of Texas Austin
USS Texas
Texas (band)
Texas (musical)
Texas (TV Series)
Texas (novel)
Texas (SpongeBob
episode)
Texas Instruments
Texas County, OK
...
Tags:
NBC network shows
American television soaps
Television spin-offs
Contexts:
Another World
Pam Long
Paul Rauch
...
• the titles of entity pages
the titles of redirecting pages
the disambiguation pages
the references to entity pages in other
Wikipedia articles.
the titles ofhttp://en.wikipedia.org/wiki/Another_World_in_Texas
entity pages
• the titles of redirecting pages
the disambiguation pages Another World in Texas
the references to entity pages in other
Wikipedia articles.
Texas (TV Series)
the titles of entity pages
the titles of redirecting pages
• the disambiguation pages
the references to entity pages in other
Wikipedia articles.
the titles of entity pages
the titles of redirecting pages
the disambiguation pages
Texas (TV Series)
• the references to entity pages in other
• Wikipedia articles
articles.
• List pages (“List of [...]” “Table of [...]”)
540,000 pairs
Wikipedia categories
2.65 million pairs
Lexicosyntactic patterns
noisy
List pages (“List of [...]” “Table of [...]”)
540,000 pairs
• Wikipedia categories
2.65 million pairs
Lexicosyntactic patterns
noisy
List pages (“List of [...]” “Table of [...]”)
540,000 pairs
Wikipedia categories
2.65 million pairs
• Lexicosyntactic patterns
noisy
LEX_Scotland_Music_#1
Appositives and parentheticals in the titles
E.g.: Texas (TV Series)
Texas, Queensland
Entity references (links)
Appositives and parentheticals in the titles
E.g.: Texas (TV Series)
Texas, Queensland
• Entity references (links)
Document
Analysis
Truecasing
Named Entity
Recognition
Coreference
Resolution
Disambiguation
Stage 1: Sentence boundary detection and
truecasing (sentence beginnings and titles)
Stage 2: Structural ambiguity resolution
Conjunctions (e.g., Barnes and Noble)
Possessives (e.g., Britain’s Tony Blair)
Pp attachment (e.g., Whitney Museum in
New York)
by using Wikipedia and Web statistics:
T1 Particle T2  search engine query “T1” “T2”
Stage 3: 5-way named entity classification
Stage 4: Coreference resolution
Shorter to longer forms
e.g., Brown/PERSON  Michael Brown/PERSON
Acronyms
Web and Corpus Stats
Regular Expressions
Gazetteers
Gazetteers
Web Search
Query Logs
Wikipedia Entities
CoNLL 2003 Statistics
Gazetteers
Heuristics
C = {c1,…,cM} - known contexts (all surface forms and appositives/parentheticals)
T = {t1,…,tN} - known category tags
Text document D
e1s1 ,..., e|s1( s1 )|
s1
e1s1 ,..., e|s1( s1 )|
si
e1si ,..., eksi ,..., e|si( si )|
Cksi , Tksi
sj
1
sj
l
sj
| ( s j )|
e ,..., e ,..., e
sj
sj
l
C , Tl
e1si ,..., eksi ,..., e|si( si )|
sn
d=D∩C
sj
e1sn ,..., e|sn( sn )|
Maximize the similarity between the
document context d and each entity’s contexts
as well as the category tags of each entity pair.
Optimization problem:
n
n
n
arg max   Cei , d     Tei , Te j 
( e1 ,.., en )
 ( s1 ).. ( sn )
i 1
i 1 j 1
j i
More robust and simpler :
d d
T
 

e
sS ( D ) e ( s )
n
arg max
  (C
( e1 ,.., en ) ( s1 ).. ( s n ) i 1
ei
, Tei ), d  (0, Tei ) 
arg max  (Cei , Tei ), d   || Tei ||2 , i 1..n
ei  ( si )
# category tags of ei
Development:
Reference: Wikipedia ver. 04/02/2006
100 news stories from MSNBC
Evaluation:
Reference: Wikipedia ver. 09/11/2006
Test sets:
Wikipedia data: 88.3%
News:
91.4%
© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Microsoft Research
Faculty Summit 2007
Download