1 Information Extraction Can we automatically Extract this information From the text (instead of depending on creators To provide automated annotations?) 2 What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. NAME TITLE ORGANIZATION "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Slides from Cohen & McCallum What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Slides from Cohen & McCallum Tapping into the Collective Unconscious How can we possibly do this without full NLP? • Another thread of exciting research is driven by the realization that WEB is not random at all! – It is written by humans – …so analyzing its structure and content allows us to tap into the collective unconscious .. • Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness” • Examples: – Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper) – Analyzing the link-structure of the web graph to discover communities • DoD and NSA are very much into this as a way of breaking terrorist cells – Analyzing the transaction patterns of customers (collaborative filtering) “(Un)wrapping the wrapped results..” Fielded IE Systems: Citeseer, Google Scholar; Libra How do they do it? Why do they fail? 6 4/30 7 IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Load DB Document collection Train extraction models Label training data Database Query, Search Data mine Slides from Cohen & McCallum What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Richard Stallman, founder of the Free Software Foundation, countered saying… Slides from Cohen & McCallum What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Richard Stallman, founder of the Free Software Foundation, countered saying… Slides from Cohen & McCallum What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Richard Stallman, founder of the Free Software Foundation, countered saying… Slides from Cohen & McCallum What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ * Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation Richard Stallman, founder of the Free Software Foundation, countered saying… Slides from Cohen & McCallum Information Extraction vs. NLP? • Information extraction is attempting to find some of the structure and meaning in the hopefully template driven web pages. • As IE becomes more ambitious and text becomes more free form, then ultimately we have IE becoming equal to NLP. • Web does give one particular boost to NLP – Massive corpora.. 16 MUC • DARPA funded significant efforts in IE in the early to mid 1990’s. • Message Understanding Conference (MUC) was an annual event/competition where results were presented. • Focused on extracting information from news articles: – Terrorist events – Industrial joint ventures – Company management changes • Information extraction of particular interest to the intelligence community (CIA, NSA). 17 What makes IE from the Web Different? Less grammar, but more formatting & linking Newswire Web www.apple.com/retail Apple to Open Its First Retail Store in New York City MACWORLD EXPO, NEW YORK--July 17, 2002-Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience. www.apple.com/retail/soho www.apple.com/retail/soho/theatre.html "Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles." The directory structure, link structure, formatting & layout of the Web is its own new grammar. Slides from Cohen & McCallum Landscape of IE Tasks (1/4): Pattern Feature Domain Text paragraphs without formatting Grammatical sentences and some formatting & links Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, rich formatting & links Tables Slides from Cohen & McCallum Landscape of IE Tasks (2/4): Pattern Scope Web site specific Formatting Amazon.com Book Pages Genre specific Layout Resumes Wide, non-specific Language University Names Slides from Cohen & McCallum Landscape of IE Tasks (3/4): Pattern Complexity E.g. word patterns: Closed set Regular set U.S. states U.S. phone numbers He was born in Alabama… Phone: (413) 545-1323 The big Wyoming sky… The CALD main office can be reached at 412-268-1299 Complex pattern U.S. postal addresses University of Arkansas P.O. Box 140 Hope, AR 71802 Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Ambiguous patterns, needing context + many sources of evidence Person names …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, Software Engineer at WhizBang Labs. Slides from Cohen & McCallum Landscape of IE Tasks (4/4): Pattern Combinations Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Person: Jeffrey Immelt Location: Connecticut N-ary record Relation: Company: Title: Out: In: Succession General Electric CEO Jack Welsh Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut “Named entity” extraction Slides from Cohen & McCallum Evaluation of Single Entity Extraction TRUTH: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. PRED: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. # correctly predicted segments Precision = 2 = # predicted segments 6 # correctly predicted segments Recall = 2 = # true segments 4 1 F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 2 Slides from Cohen & McCallum State of the Art Performance • Named entity recognition – Person, Location, Organization, … – F1 in high 80’s or low- to mid-90’s • Binary relation extraction – Contained-in (Location1, Location2) Member-of (Person1, Organization1) – F1 in 60’s or 70’s or 80’s • Wrapper induction – Extremely accurate performance obtainable – Human effort (~30min) required on each site Slides from Cohen & McCallum Landscape of IE Techniques (1/1): Models Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Sliding Window Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternate window sizes: Finite State Machines Abraham Lincoln was born in Kentucky. Context Free Grammars Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P Classifier PP which class? VP NP BEGIN END BEGIN NP END VP S …and beyond Any of these models can be used to capture words, formatting or both. Slides from Cohen & McCallum Three Examples • (un)wrappers – That use path expressions on dom trees • Pattern extractors – That use path expressions on parse trees • Context-based slot fillers – That annotate words into an ontology with the help of context surrounding them 26 Extraction from Templated Text • Many web pages are generated automatically from an underlying database. • Therefore, the HTML structure of pages is fairly specific and regular (semi-structured). • However, output is intended for human consumption, not machine interpretation. • An IE system for such generated pages allows the web site to be viewed as a structured database. • An extractor for a semi-structured web site is sometimes referred to as a wrapper. • Process of extracting from such pages is sometimes referred to as screen scraping. 28 Templated Extraction using DOM Trees • Web extraction may be aided by first parsing web pages into DOM trees. • Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract. • May still need regex patterns to identify proper portion of the final CharacterData node. 29 Sample DOM Tree Extraction HTML Element HEADER BODY B Can be “semi-automated” Users show examples and the program remembers the path expressions Wrapper maintenance? Cheap labor… Age of Spiritual Machines Character-Data FONT by A Ray Kurzweil Title: HTMLBODYBCharacterData Author: HTML BODYFONTA CharacterData 30 31 32 If there is cooperation from the source, an API can be established removing the need for wrappers Basis for many startups like Junglee, Flipdog etc 33 Three Examples • (un)wrappers – That use path expressions on dom trees • Pattern extractors – That use path expressions on parse trees • Context-based slot fillers – That annotate words into an ontology with the help of context surrounding them •34 • If extracting from automatically generated web pages, simple regex patterns usually work. • If extracting from more natural, unstructured, human-written text, some NLP may help. – Part-of-speech (POS) tagging • Mark each word as a noun, verb, preposition, etc. – Syntactic parsing • Identify phrases: NP, VP, PP – Semantic word categories (e.g. from WordNet) • KILL: kill, murder, assassinate, strangle, suffocate • Off-the-shelf software available to do this! – The “Brill” tagger • Extraction patterns can use POS or phrase tags. Analogy to regex patterns on DOM trees for structured tex Extraction from Free Text involves Natural Language Processing 35 I. Generate-and-Test Architecture Generic extraction patterns (Hearst ’92): • “…Cities such as Boston, Los Angeles, and Seattle…” (“C such as NP1, NP2, and NP3”) => IS-A(each(head(NP)), C), … •Detailed information for several countries such as maps, …” ProperNoun(head(NP)) • “I listen to pretty much all music but prefer Template Driven Extraction (where template In in terms of Syntax Tree) country such as Garth Brooks” 36 Assessing the fact accuracy Recall “water flows upwards” PMI(Seattle,Tomato)=1.5M/107M ~1% Seattle is 20times more likely to be a city than a tomato! Assess candidate extractions using Mutual Information (PMI-IR) (Turney ’01). | Hits ( Seattle City) | PMI ( Seattle, City) | Hits ( Seattle) | = 24.7M/107M ~23% 37 ..but many things indicate “city”ness Discriminator phrases fi : “x is a city” “x has a population of” “x is the capital of y” “x’s baseball team…” | Hits ( I D) | PMI ( I , D) | Hits ( I ) | •PMI = frequency of I & D co-occurrence •5-50 discriminators Di •Each PMI for Di is a feature fi •Naïve Bayes evidence combination: P ( | f 1 , f 2 ,... f n ) Keep the probablities with the extracted facts P ( )i P ( f i | ) P ( )i P ( f i | ) P ( )i P ( f i | ) PMI is used for feature selection. NBC is used for learning. Hits used for assessing 38 PMI as well as conditional probabilities Some Sources of ambiguity • • • • Time: “Clinton is the president” (in 1996). Context: “common misconceptions..” Opinion: Elvis… Multiple word senses: Amazon, Chicago, Chevy Chase, etc. – Dominant senses can mask recessive ones! – Approach: unmasking. ‘Chicago –City’ 40 Chicago City Movie | Hits ( I D | C ) | PMI ( I , D, C ) | Hits ( I | C ) | 41 Chicago Unmasked City sense Movie sense | Hits (Chicago Movie City) | | Hits (Chicago City) | 42 Impact of Unmasking on PMI Name Washington Casablanca Chevy Chase Chicago Recessive Original Unmask Boost city 0.50 0.99 96% city 0.41 0.93 127% actor 0.09 0.58 512% movie 0.02 0.21 972% 43 Three Examples • (un)wrappers – That use path expressions on dom trees • Pattern extractors – That use path expressions on parse trees • Context-based slot fillers – That annotate words into an ontology with the help of context surrounding them •48 Annotate base facts, given text and ontology 49 Annotation “The Chicago Bulls announced yesterday that Michael Jordan will. . . ” The <resource ref="http://tap.stanford.edu/ BasketballTeam_Bulls">Chicago Bulls</resource> announced yesterday that <resource ref= "http://tap.stanford.edu/AthleteJordan,_Michael"> Michael Jordan</resource> will...’’ 50 Semantic Annotation Name Entity Identification This simplest task of meta-data extraction on NL is to establish “type” relation between entities in the NL resources and concepts in ontologies. 51 Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt Annotation Current practice of annotation for knowledge identification and extraction is time consuming needs annotation by experts is complex Reduce burden of text annotation for Knowledge Management 55 www.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.ppt SemTag & Seeker WWW-03 Best Paper Prize Seeded with TAP ontology (72k concepts) And ~700 human judgments Crawled 264 million web pages Extracted 434 million semantic tags Automatically disambiguated SemTag • Uses broad, shallow knowledge base • TAP – lexical and taxonomic information about popular objects – Music – Movies – Sports – Etc. 58 SemTag • Problem: – No write access to original document, so how do you annotate? • Solution: – Store annotations in a web-available database 59 SemTag • Semantic Label Bureau – Separate store of semantic annotation information – HTTP server that can be queried for annotation information – Example • Find all semantic tags for a given document • Find all semantic tags for a particular object 60 SemTag • Methodology 61 SemTag • Three phases 1. Spotting Pass: – – 2. Tokenize the document All instances plus 20 word window Learning Pass: – – 3. Find corpus-wide distribution of terms at each internal node of taxonomy Based on a representative sample Tagging Pass: – – Scan windows to disambiguate each reference Finally determined to be a TAP object 62 SemTag • Solution: – • Taxonomy Based Disambiguation (TBD) TBD expectation: – Human tuned parameters used in small, critical sections – Automated approaches deal with bulk of information 64 SemTag • TBD methodology: – Each node in the taxonomy is associated with a set of labels • Cats, Football, Cars all contain “jaguar” – Each label in the text is stored with a window of 20 words – the context – Each node has an associated similarity function mapping a context to a similarity • Higher similarity more likely to contain a reference 65 SemTag Is a context c appropriate for a node v References inside the taxonomy vs. References outside the taxonomy Multiple nodes: b = r b != p(v) 67 Summary • Information extraction can be motivated either as explicating more structure from the data or as an automated way to Semantic Web • Extraction complexity depends on whether the text you have is “templated” or “free-form” – Extraction from templated text can be done by regular expressions – Extraction from free form text requires NLP • Can be done in terms of parts-of-speech-tagging • “Annotation” involves connecting terms in a free form text to items in the background knowledge – It too can be automated 70