Open Information Extraction from the Web Oren Etzioni KnowItAll Project (2003…) Rob Bart Janara Christensen Tony Fader Tom Lin Alan Ritter Michael Schmitz Dr. Niranjan Balasubramanian Dr. Stephen Soderland Prof. Mausam Prof. Dan Weld PhD alumni: Michele Banko, Prof. Michael Cafarella, Prof. Doug Downey, Ana-Maria Popescu, Stefan Schoenmackers, and Prof. Alex Yates Funding: DARPA, IARPA, NSF, ONR, Google. Etzioni, University of Washington 2 Outline I. II. III. IV. A “scruffy” view of Machine Reading Open IE (overview, progress, new demo) Critique of Open IE Future work: Open, Open IE Etzioni, University of Washington 3 I. Machine Reading (Etzioni, AAAI ‘06) • “MR is an exploratory, open-ended, serendipitous process” • “In contrast with many NLP tasks, MR is inherently unsupervised” • “Very large scale” • “Forming Generalizations based on extracted assertions” Etzioni, University of Washington 4 Lessons from DB/KR Research • Declarative KR is expensive & difficult • Formal semantics is at odds with – Broad scope – Distributed authorship • KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance” (Halevy IJCAI ‘03) Etzioni, University of Washington 5 Machine Reading at Web Scale • A “universal ontology” is impossible • Global consistency is like world peace • Micro ontologies--scale? Interconnections? • Ontological “glass ceiling” – Limited vocabulary – Pre-determined predicates – Swamped by reading at scale! Etzioni, University of Washington 6 OPEN VERSUS TRADITIONAL IE II. Open vs. Traditional IE Traditional IE Open IE Corpus + O(R) hand-labeled data Corpus Relations: Specified in advance Discovered automatically Extractor: Relation-specific Relationindependent Input: How is Open IE Possible? Etzioni, University of Washington 7 Semantic Tractability Hypothesis ∃ easy-to-understand subset of English • Characterized relations/arguments syntactically (Banko, ACL ’08; Fader, EMNLP ’11; Etzioni, IJCAI ‘11) • Characterization is compact, domain independent • Covers 85% of binary, verb-based relations Etzioni, University of Washington 8 SAMPLE OrrF EXTRACTED RELATIONS SAMPLE RELATION PHRASES invented acquired by has a PhD in denied voted for inhibits tumor growth in inherited born in mastered the art of downloaded aspired to is the patron saint of expelled Arrived from wrote the book on Etzioni, University of Washington 9 NUMBER OF RELATIONS Number of Relations DARPA MR Domains NYU, Yago NELL DBpedia 3.2 PropBank VerbNet WikiPedia InfoBoxes, f > 10 TextRunner (phrases) ReVerb (phrases) <50 <100 ~500 940 3,600 5,000 ~5,000 100,000+ 1,000,000+ Etzioni, University of Washington 10 TEXTRUNNER TextRunner (2007) First Web-scale Open IE system Distant supervision + CRF models of relations (Arg1, Relation phrase, Arg2) 1,000,000,000 distinct extractions Etzioni, University of Washington 11 Relation Extraction from Web Etzioni, University of Washington 12 Open IE (2012) After beating the Heat, the Celtics If he wins 5 key states, Romney will are now the “top dog” in the NBA. be president • Open source ReVerb extractor (the Celtics, beat, the Heat) (counterfactual: “if he wins 5 • Synonym detection keyextractor states”)(Mausam EMNLP ‘12) • Parser-based Ollie – Verbs Nouns and more – Analyze context (beliefs, counterfactuals) • Sophistication of IE is a major focus But what about entities, types, ontologies? Etzioni, University of Washington 13 Towards “Ontologized” Open IE • Link arguments to Freebase (Lin, AKBC ‘12) – When possible! • Associate types with Args • No Noun Phrase Left Behind (Lin, EMNLP ‘12) Etzioni, University of Washington 14 System Architecture Input Web corpus Processing Extractor Raw tuples Assessor Extractions Query processor Output (XYZ Corp.; acquired; Go Inc.) (oranges; contain; Vitamin C) (Einstein; was born in; Ulm) (XYZ; buyout of; Go Inc.) (Albert Einstein; born in; Ulm) (Einstein Bros.; sell; bagels) XYZ Corp. = XYZ Albert Einstein = Einstein != Einstein Bros. Acquire(XYZ Corp., Go Inc.) BornIn(Albert Einstein, Ulm) Sell(Einstein Bros., bagels) Contain(oranges, Vitamin C) [7] [5] [1] [1] Relationindependent extraction Synonyms, Confidence Index in Lucene; Link entities DEMO Etzioni, University of Washington 15 III. Critique of Open IE • • • • Lack of formal ontology/vocabulary Inconsistent extractions Can it support reasoning? What’s the point of Open IE? Etzioni, University of Washington 16 Perspectives on Open IE A. “Search Needs a Shakeup” (Etzioni, Nature ’11) B. Textual Resources C. Reasoning over Extractions Etzioni, University of Washington 17 A. New Paradigm for Search “Moving Up the Information Food Chain” (Etzioni, AAAI ‘96) Retrieval Extraction Snippets, docs Entities, Relations Keyword queries Questions List of docs Answers Essential for smartphones! (Siri meets Watson) Etzioni, University of Washington 18 Case Study over Yelp Reviews 1. Map review corpus to (attribute, value) (sushi = fresh) (parking = free) 2. Natural-language queries “Where’s the best sushi in Seattle?” 3. Sort results via sentiment analysis exquisite > very good > so, so Etzioni, University of Washington 19 RevMiner: Extractive Interface to 400K Yelp Reviews (Huang, UIST ’12) Etzioni, University of Washington 20 B. Public Textual Resources (Leveraging Open IE) • 94M Rel-grams: n-grams, but over relations in text (Balasubarmanian. AKBC’12) • 600K Relation phrases (Fader, EMNLP ‘11) • Relation Meta-data: – 50K Domain/range for relations (Ritter, ACL ‘10) – 10K Functional relations (Lin, EMNLP ‘10) • 30K learned Horn clauses (Schoenmackers, EMNLP ‘10) • CLEAN (Berant, ACL ‘12) – 10M entailment rules (coming soon) – Precision double that of DIRT See openie.cs.washington.edu Etzioni, University of Washington 21 C. Reasoning over Extractions Identify synonyms (Yates & Etzioni JAIR ‘09) 1,000,000,000 Extractions Linear-time 1st order Horn-clause inference (Schoenmackers EMNLP ’08) Learn argument types Via generative model (Ritter ACL ‘10) Etzioni, University of Washington Transitive Inference (Berant ACL ’11) 22 Unsupervised, probabilistic model for identifying synonyms • P(Bill Clinton = President Clinton) – Count shared (relation, arg2) • P(acquired = bought) – Relations: count shared (arg1, arg2) • Functions, mutual recursion • Next step: unify with Etzioni, University of Washington 23 Scalable Textual Inference Desiderata for inference: • In text probabilistic inference • On the Web linear in |Corpus| Argument distributions of textual relations: • Inference provably linear • Empirically linear! 24 Inference Scalability for Holmes 25 Extractions Domain/range • Much previous work (Resnick, Pantel, etc.) • Utilize generative topic models Extractions of R Document Domain/range of R topics 26 Relations TextRunner as Extractions Documents born_in(Sergey Brin,Moscow) headquartered_in(Microsoft, Redmond) born_in(Bill Gates, Seattle) born_in(Einstein, founded_in(Google, March) 1998) headquartered_in(Google, Mountain View) born_in(Sergey Brin,1973) founded_in(Microsoft, Albuquerque) born_in(Einstein, Ulm) founded_in(Microsoft, 1973) 27 a Generative Story [LinkLDA, Erosheva et. al. 2004] X born_in Y P(Topic1|born_in)=0.5 P(Topic2|born_in)=0.3 … For each relation, randomly pick a distribution over types Person born_in Location For each Pick a topicpick for extraction, arg2a1, a2 type for Sergey Brin born_in Moscow Then pick Pick a topic for arguments based onarg2 types z1 z2 a1 a2 N R Two separate sets Pickof a topic type for arg2 distributions g T h1 T h2 29 Examples of Learned Domain/range • • • • • elect(Country, Person) predict(Expert, Event) download(People, Software) invest(People, Assets) Was-born-in(Person, Location OR Date) Etzioni, University of Washington 30 Summary: Trajectory of Open IE 2012 2010-11 2008-9 2007 TextRunner: 1,000,000,000 2003 KnowItAll “Ontology free” extractions project Inference over extractions Freebase types Open source extractor Public IE-based search Deeper analysis of sentences textual Resources Openie.cs.washington.edu Etzioni, University of Washington 31 IV. Future: Open Open IE • Open input: ingest tuples from any source (Tuple, Source, Confidence) • Linked Open Output: – Extractions Linked-open Data (LOD) cloud – Relation normalization – Use LOD best practices • Specialized reasoners Etzioni, University of Washington 32 Conclusions 1. 2. 3. 4. Ontology is not necessary for reasoning Open IE is “gracefully” ontologized Open IE is boosting text analysis LOD has distribution & scale (but not text) = opportunity Etzioni, University of Washington 33 qs • • • • • • Why Open? What’s next? Dimensions for analyzing systems What’s worked, what’s failed? (lessons) What can we learn from watson? What can we learn from db/kr? (alon) Etzioni, University of Washington 34 Questions • • • • • • • • • • • • What extraction mechanism is used? What corpus? What input knowledge? Role for people/manual labling Form of the extracted knowledge? Size/scope of extracted knowledge? What reasoning is done? Most unique aspect? Biggest challenge? Etzioni, University of Washington 35 Scalability notes • Interoperability, distributed authorship, vs. a monolithic system • Open IE meets RDF: – Need URI’s for predicates. How to obtain? – What about errors in mapping to URI? – Ambiguity? Uncertainty? Etzioni, University of Washington 36 reasoning • Nell: inter-class constraints to gen negative egs Etzioni, University of Washington 37 Dims of scalability • Corpus size • Syn coverage over text • Sem coverage over text – Time, belief, n-ary relations, etc. • • • • • Number of entities, relations Ability to reason How much cpu? How much manual effort? Bounding, cielign effect, ontological glass ceiling Etzioni, University of Washington 38 Example of limiting assumptions • Nell: apple has single meaning • Single atom per entity – Global computation to add entity – Can’t be sure • LOD: – Best practice – Same-as links Etzioni, University of Washington 39 Risk for scalable system • Limited semantics, reasoning • No reasoning… Etzioni, University of Washington 40 LOD triple in aug 2011: 31,634,213,770 Etzioni, University of Washington 41 • . The following statement appears in the last paragraph of W3C Linked Library Data Group Final Report: • . . . Linked Data follows an open-world assumption: the assumption that data cannot generally be assumed to be complete and that, in principle, more data may become available for any given entity. Etzioni, University of Washington 42 Etzioni, University of Washington 43 Entity Linking an Extraction Corpus Einstein quit his job at the patent office (8) 1. String Match US Patent Office (med) EU Patent Office (med) Japan Patent Office (med) Swiss Patent Office (med) (low) Patent 2. Prominence Priors 1,281 inlinks 168 inlinks 56 inlinks 101 inlinks 4,620 inlinks 3. Context Match (low) (low) (low) (very high) (low) Link Score (med) (low) (low) (high) (low) “Document” of Prominence the extraction’s Obtain candidates, and measureto string # source of linkssentences in Wikipedia that similarity. Entity’sWikipedia article Article Texts ∝ Collective Linking vs One Extraction at a time Link Score is a function of (String Match Score, Prominence Prior Score, Context Match Score) “Einstein quit his job at the patent Exactoffice.” String Match = best match US cosine US quit his job at theEUpatent2.53GHz Japan SwissPatent EU Japan “Einstein office tocomputer also consider: similarity e.g., String Match Score x ln(Prominence Prior Score) x Context Match Score Faster Patent Patent Patenttext Patent Patent become a professor.” Patent links 15 million Office Swiss Patent Known Office Substring/ Office Potential Higher Precision Office Office Office “InAlternate 1909, Einstein quitEdit hisOffice job at arguments theWord patent office.” in 2~3 nd days Patent Top Link Score Patent capitalization overlap Aliases Superstring Abbreviations Ambiguity =second) “Einstein quit his jobdistance at theLink patent officeper where (60+ Office Top Link Score he worked.” Etzioni, University of Washington 44 Q/A with Linked Extractions • Ambiguous Entities Sports that originated in China • Typed Search “Titanic “The Titanic “Golf “Noodles earned originated setoriginated more sail from than in China” Southampton”” in$1China” billion worldwide” • Linked Resources“The Titanic “Soccersank “Printmaking originated inoriginated 1912” in China” in China” Golf “Which “I need to sports learn about Titanic the in originated ship China?” for my homework.” Soccer “RMSTitanic “The “Karate “Soy Titanic Beans was weighed originated released originated about inin China” in1998” 26China” kt” “Titanic “The Titanic “Wushu represents was originated built thefor state-of-the-art insafety China” and comfort” in special Wushu Karate “The effects” Titanic “Dragonsank “Taoism originated Boating in 12,460 originated in China” feet ofinwater” China” “Titanic “Ping wasPong builtoriginated in Belfast”in China” Dragon (534 (14 (3,761 (1,902 more more more …) …) …) Ping Boating Leverages KBs by linking textual arguments to entities found in the knowledge base. Etzioni, University of Washington … Pong Freebase Sports “Dragon Boat Racing” “Table Tennis” … 45 Linked Extractions support Reasoning In addition to Question Answering, Linking can also benefit: Functions [Ritter et al., 2008; Lin et al., 2010] Other Relation Properties [Popescu 2007; Lin et al., CSK 2010] Inference [Schoenmackers et al., 2008; Berant et al., 2011] Knowledge-Base Population [Dredze et al., 2010] Concept-Level Annotations [Christensen and Pasca, 2012] … basically anything using the output of extraction Other Web-based text containing Entities (e.g., Query Logs) can also be linked to enable new experiences… Etzioni, University of Washington 46 Challenges • Single-sentence extraction – He believed the plan will work – John Glenn was the first American in space – Obama was elected President in 2008. – American president Barack Obama asserted… • ?? Etzioni, University of Washington 48