slides - AKBC

advertisement
Open Information Extraction
from the Web
Oren Etzioni
KnowItAll Project (2003…)
Rob Bart
Janara Christensen
Tony Fader
Tom Lin
Alan Ritter
Michael Schmitz
Dr. Niranjan Balasubramanian
Dr. Stephen Soderland
Prof. Mausam
Prof. Dan Weld
PhD alumni: Michele Banko, Prof. Michael Cafarella, Prof. Doug
Downey, Ana-Maria Popescu, Stefan Schoenmackers, and Prof.
Alex Yates
Funding: DARPA, IARPA, NSF, ONR, Google.
Etzioni, University of Washington
2
Outline
I.
II.
III.
IV.
A “scruffy” view of Machine Reading
Open IE (overview, progress, new demo)
Critique of Open IE
Future work: Open, Open IE
Etzioni, University of Washington
3
I. Machine Reading (Etzioni, AAAI ‘06)
• “MR is an exploratory, open-ended,
serendipitous process”
• “In contrast with many NLP tasks, MR is
inherently unsupervised”
• “Very large scale”
• “Forming Generalizations based on
extracted assertions”
Etzioni, University of Washington
4
Lessons from DB/KR Research
• Declarative KR is expensive & difficult
• Formal semantics is at odds with
– Broad scope
– Distributed authorship
• KBs are brittle: “can only be used for tasks
whose knowledge needs have been
anticipated in advance” (Halevy IJCAI ‘03)
Etzioni, University of Washington
5
Machine Reading at Web Scale
• A “universal ontology” is impossible
• Global consistency is like world peace
• Micro ontologies--scale? Interconnections?
• Ontological “glass ceiling”
– Limited vocabulary
– Pre-determined predicates
– Swamped by reading at scale!
Etzioni, University of Washington
6
OPEN VERSUS TRADITIONAL IE
II. Open vs. Traditional IE
Traditional IE
Open IE
Corpus + O(R)
hand-labeled data
Corpus
Relations:
Specified
in advance
Discovered
automatically
Extractor:
Relation-specific
Relationindependent
Input:
How is Open IE Possible?
Etzioni, University of Washington
7
Semantic Tractability Hypothesis
∃ easy-to-understand subset of English
• Characterized relations/arguments syntactically
(Banko, ACL ’08; Fader, EMNLP ’11; Etzioni, IJCAI ‘11)
• Characterization is compact, domain independent
• Covers 85% of binary, verb-based relations
Etzioni, University of Washington
8
SAMPLE OrrF EXTRACTED RELATIONS
SAMPLE RELATION PHRASES
invented
acquired by
has a PhD in
denied
voted for
inhibits tumor
growth in
inherited
born in
mastered the art of
downloaded
aspired to
is the patron
saint of
expelled
Arrived from
wrote the book on
Etzioni, University of Washington
9
NUMBER OF RELATIONS
Number of Relations
DARPA MR Domains
NYU, Yago
NELL
DBpedia 3.2
PropBank
VerbNet
WikiPedia InfoBoxes, f > 10
TextRunner (phrases)
ReVerb (phrases)
<50
<100
~500
940
3,600
5,000
~5,000
100,000+
1,000,000+
Etzioni, University of Washington
10
TEXTRUNNER
TextRunner (2007)
First Web-scale Open IE system
Distant supervision + CRF models of relations
(Arg1, Relation phrase, Arg2)
1,000,000,000 distinct extractions
Etzioni, University of Washington
11
Relation Extraction from Web
Etzioni, University of Washington
12
Open
IE
(2012)
After
beating
the Heat,
the Celtics
If he wins 5 key states, Romney
will
are now the “top dog” in the NBA.
be
president
• Open source ReVerb extractor
(the Celtics,
beat,
the
Heat)
(counterfactual:
“if
he
wins
5
• Synonym detection
keyextractor
states”)(Mausam EMNLP ‘12)
• Parser-based Ollie
– Verbs  Nouns and more
– Analyze context (beliefs, counterfactuals)
• Sophistication of IE is a major focus
But what about entities, types, ontologies?
Etzioni, University of Washington
13
Towards “Ontologized” Open IE
• Link arguments to Freebase (Lin, AKBC ‘12)
– When possible!
• Associate types with Args
• No Noun Phrase Left Behind
(Lin, EMNLP ‘12)
Etzioni, University of Washington
14
System Architecture
Input
Web
corpus
Processing
Extractor
Raw tuples
Assessor
Extractions
Query
processor
Output
(XYZ Corp.; acquired; Go Inc.)
(oranges; contain; Vitamin C)
(Einstein; was born in; Ulm)
(XYZ; buyout of; Go Inc.)
(Albert Einstein; born in; Ulm)
(Einstein Bros.; sell; bagels)
XYZ Corp. = XYZ
Albert Einstein = Einstein !=
Einstein Bros.
Acquire(XYZ Corp., Go Inc.)
BornIn(Albert Einstein, Ulm)
Sell(Einstein Bros., bagels)
Contain(oranges, Vitamin C)
[7]
[5]
[1]
[1]
Relationindependent
extraction
Synonyms,
Confidence
Index in Lucene;
Link entities
DEMO
Etzioni, University of Washington
15
III. Critique of Open IE
•
•
•
•
Lack of formal ontology/vocabulary
Inconsistent extractions
Can it support reasoning?
What’s the point of Open IE?
Etzioni, University of Washington
16
Perspectives on Open IE
A. “Search Needs a Shakeup” (Etzioni, Nature ’11)
B. Textual Resources
C. Reasoning over Extractions
Etzioni, University of Washington
17
A. New Paradigm for Search
“Moving Up the Information Food Chain”
(Etzioni, AAAI ‘96)
Retrieval  Extraction
Snippets, docs  Entities, Relations
Keyword queries  Questions
List of docs  Answers
Essential for smartphones!
(Siri meets Watson)
Etzioni, University of Washington
18
Case Study over Yelp Reviews
1. Map review corpus to (attribute, value)
(sushi = fresh) (parking = free)
2. Natural-language queries
“Where’s the best sushi in Seattle?”
3. Sort results via sentiment analysis
exquisite > very good > so, so
Etzioni, University of Washington
19
RevMiner: Extractive Interface to
400K Yelp Reviews (Huang, UIST ’12)
Etzioni, University of Washington
20
B. Public Textual Resources
(Leveraging Open IE)
• 94M Rel-grams: n-grams, but over relations in text
(Balasubarmanian. AKBC’12)
• 600K Relation phrases (Fader, EMNLP ‘11)
• Relation Meta-data:
– 50K Domain/range for relations (Ritter, ACL ‘10)
– 10K Functional relations (Lin, EMNLP ‘10)
• 30K learned Horn clauses (Schoenmackers, EMNLP ‘10)
• CLEAN (Berant, ACL ‘12)
– 10M entailment rules (coming soon)
– Precision double that of DIRT
See openie.cs.washington.edu
Etzioni, University of Washington
21
C. Reasoning over Extractions
Identify synonyms
(Yates & Etzioni JAIR ‘09)
1,000,000,000
Extractions
Linear-time 1st order
Horn-clause inference
(Schoenmackers
EMNLP ’08)
Learn argument types
Via generative model
(Ritter ACL ‘10)
Etzioni, University of Washington
Transitive
Inference
(Berant
ACL ’11)
22
Unsupervised, probabilistic model
for identifying synonyms
• P(Bill Clinton = President Clinton)
– Count shared (relation, arg2)
• P(acquired = bought)
– Relations: count shared (arg1, arg2)
• Functions, mutual recursion
• Next step: unify with
Etzioni, University of Washington
23
Scalable Textual Inference
Desiderata for inference:
• In text  probabilistic inference
• On the Web  linear in |Corpus|
Argument distributions of textual relations:
• Inference provably linear
• Empirically linear!
24
Inference Scalability for Holmes
25
Extractions  Domain/range
• Much previous work (Resnick, Pantel, etc.)
• Utilize generative topic models
Extractions of R  Document
Domain/range of R  topics
26
Relations
TextRunner
as Extractions
Documents
born_in(Sergey Brin,Moscow)
headquartered_in(Microsoft,
Redmond)
born_in(Bill Gates, Seattle)
born_in(Einstein,
founded_in(Google,
March)
1998)
headquartered_in(Google,
Mountain View)
born_in(Sergey Brin,1973)
founded_in(Microsoft,
Albuquerque)
born_in(Einstein,
Ulm)
founded_in(Microsoft,
1973)
27
a
Generative Story
[LinkLDA, Erosheva et. al. 2004]
X born_in Y
P(Topic1|born_in)=0.5
P(Topic2|born_in)=0.3
…
For each relation,
randomly pick a
distribution over
types
Person born_in Location
For each
Pick a topicpick
for
extraction,
arg2a1, a2
type for
Sergey Brin born_in Moscow
Then pick
Pick a topic
for
arguments
based
onarg2
types

z1
z2
a1
a2
N
R

Two separate sets
Pickof
a topic
type for
arg2
distributions
g
T
h1
T
h2
29
Examples of Learned Domain/range
•
•
•
•
•
elect(Country, Person)
predict(Expert, Event)
download(People, Software)
invest(People, Assets)
Was-born-in(Person, Location OR Date)
Etzioni, University of Washington
30
Summary: Trajectory of Open IE
2012
2010-11
2008-9
2007
TextRunner:
1,000,000,000
2003
KnowItAll “Ontology free”
extractions
project
Inference
over
extractions
Freebase types
Open source
extractor
Public
IE-based search
Deeper analysis
of sentences
textual
Resources
Openie.cs.washington.edu
Etzioni, University of Washington
31
IV. Future: Open Open IE
• Open input: ingest tuples from any source
(Tuple, Source, Confidence)
• Linked Open Output:
– Extractions  Linked-open Data (LOD) cloud
– Relation normalization
– Use LOD best practices
• Specialized reasoners
Etzioni, University of Washington
32
Conclusions
1.
2.
3.
4.
Ontology is not necessary for reasoning
Open IE is “gracefully” ontologized
Open IE is boosting text analysis
LOD has distribution & scale (but not text) =
opportunity
Etzioni, University of Washington
33
qs
•
•
•
•
•
•
Why Open?
What’s next?
Dimensions for analyzing systems
What’s worked, what’s failed? (lessons)
What can we learn from watson?
What can we learn from db/kr? (alon)
Etzioni, University of Washington
34
Questions
•
•
•
•
•
•
•
•
•
•
•
•
What extraction mechanism is used?
What corpus?
What input knowledge?
Role for people/manual labling
Form of the extracted knowledge?
Size/scope of extracted knowledge?
What reasoning is done?
Most unique aspect?
Biggest challenge?
Etzioni, University of Washington
35
Scalability notes
• Interoperability, distributed authorship, vs. a
monolithic system
• Open IE meets RDF:
– Need URI’s for predicates. How to obtain?
– What about errors in mapping to URI?
– Ambiguity? Uncertainty?
Etzioni, University of Washington
36
reasoning
• Nell: inter-class constraints to gen negative
egs
Etzioni, University of Washington
37
Dims of scalability
• Corpus size
• Syn coverage over text
• Sem coverage over text
– Time, belief, n-ary relations, etc.
•
•
•
•
•
Number of entities, relations
Ability to reason
How much cpu?
How much manual effort?
Bounding, cielign effect, ontological glass ceiling
Etzioni, University of Washington
38
Example of limiting assumptions
• Nell: apple has single meaning
• Single atom per entity
– Global computation to add entity
– Can’t be sure
• LOD:
– Best practice
– Same-as links
Etzioni, University of Washington
39
Risk for scalable system
• Limited semantics, reasoning
• No reasoning…
Etzioni, University of Washington
40
LOD triple in aug 2011:
31,634,213,770
Etzioni, University of Washington
41
• . The following statement appears in the last
paragraph of W3C Linked Library Data Group
Final Report:
• . . . Linked Data follows an open-world
assumption: the assumption that data cannot
generally be assumed to be complete and
that, in principle, more data may become
available for any given entity.
Etzioni, University of Washington
42
Etzioni, University of Washington
43
Entity Linking an Extraction Corpus
Einstein quit his job at the patent office (8)
1. String Match
US Patent Office (med)
EU Patent Office (med)
Japan Patent Office (med)
Swiss Patent Office (med)
(low)
Patent
2. Prominence Priors
1,281 inlinks
168 inlinks
56 inlinks
101 inlinks
4,620 inlinks
3. Context Match
(low)
(low)
(low)
(very high)
(low)
Link Score
(med)
(low)
(low)
(high)
(low)
“Document” of Prominence
the extraction’s
Obtain candidates,
and
measureto
string
# source
of linkssentences
in
Wikipedia
that similarity.
Entity’sWikipedia
article Article Texts
∝
Collective
Linking
vs
One
Extraction
at
a
time
Link Score is a function of (String Match Score, Prominence Prior Score, Context Match Score)
“Einstein quit his job at the patent
Exactoffice.”
String Match = best match
US
cosine
US quit his job at theEUpatent2.53GHz
Japan
SwissPatent EU Japan
“Einstein
office tocomputer
also
consider:
similarity
e.g., String Match Score x ln(Prominence Prior Score) x Context Match
Score
Faster
Patent
Patent
Patenttext
Patent
Patent
become
a professor.” Patent links 15 million
Office
Swiss Patent
Known Office
Substring/ Office
Potential
Higher
Precision
Office
Office
Office
“InAlternate
1909,
Einstein
quitEdit
hisOffice
job at arguments
theWord
patent office.”
in 2~3
nd days
Patent
Top
Link
Score
Patent
capitalization
overlap
Aliases
Superstring
Abbreviations
Ambiguity
=second)
“Einstein
quit his jobdistance
at theLink
patent
officeper
where
(60+
Office
Top Link Score
he worked.”
Etzioni, University of Washington
44
Q/A with Linked Extractions
• Ambiguous Entities
Sports that originated in China
• Typed Search
“Titanic
“The
Titanic
“Golf
“Noodles
earned
originated
setoriginated
more
sail from
than
in China”
Southampton””
in$1China”
billion worldwide”
• Linked Resources“The Titanic
“Soccersank
“Printmaking
originated
inoriginated
1912”
in China”
in China”
Golf
“Which
“I
need to
sports
learn about
Titanic the in
originated
ship
China?”
for my
homework.”
Soccer
“RMSTitanic
“The
“Karate
“Soy
Titanic
Beans
was
weighed
originated
released
originated
about
inin
China”
in1998”
26China”
kt”
“Titanic
“The
Titanic
“Wushu
represents
was
originated
built
thefor
state-of-the-art
insafety
China”
and comfort”
in special
Wushu
Karate
“The
effects”
Titanic
“Dragonsank
“Taoism
originated
Boating
in 12,460
originated
in China”
feet ofinwater”
China”
“Titanic
“Ping
wasPong
builtoriginated
in Belfast”in China”
Dragon
(534
(14
(3,761
(1,902
more
more
more
…)
…) …) Ping
Boating
Leverages KBs by linking textual arguments
to entities found in the knowledge base.
Etzioni, University of Washington
…
Pong
Freebase Sports
“Dragon Boat Racing”
“Table Tennis”
…
45
Linked Extractions support Reasoning
In addition to Question Answering, Linking can also benefit:
Functions [Ritter et al., 2008; Lin et al., 2010]
Other Relation Properties [Popescu 2007; Lin et al., CSK 2010]
Inference [Schoenmackers et al., 2008; Berant et al., 2011]
Knowledge-Base Population [Dredze et al., 2010]
Concept-Level Annotations [Christensen and Pasca, 2012]
… basically anything using the output of extraction
Other Web-based text containing Entities (e.g., Query Logs)
can also be linked to enable new experiences…
Etzioni, University of Washington
46
Challenges
• Single-sentence extraction
– He believed the plan will work
– John Glenn was the first American in space
– Obama was elected President in 2008.
– American president Barack Obama asserted…
• ??
Etzioni, University of Washington
48
Download