Core Technologies for IE Günter Neumann & Feiyu Xu {neumann, feiyu}@dfki.de Language Technology-Lab DFKI, Saarbrücken Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Outline • Overview • Shallow technologies • More NL understanding • Hybrid approaches • Additional Topics Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Types of IE Tasks Extraction of ... • • • • • • Topics Terms Named Entities Simple Relations Complex Relations (template filling) Answers to ad-hoc questions (QA systems) • Some tasks require the merging of information from different parts of a document (e.g. template merging) and/or the fusion of information from several documents (information fusion). Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 IE Core Technologies A wide range of technologies for information extraction has emerged. Technologies range from non-linguistic approaches, e.g., methods for the exploitation of formatting and other formal properties of semi-structured documents all the way to truly linguistic approaches. Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 IE Core Technologies 8QVWUXFWXUHGGDWD Depending on the amount of structural analysis, NLP methods may be classified as VKDOORZ or GHHS methods. Whereas deep methods try to analyze most or all of the linguistic structure, shallow methods derive much less structure. in the sense of computer science usually exhibit a complex linguistic structure. NLP methods are employed to detect and exploit this structure. Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 IE Core Technologies Discrete methods are based on symbolic rules, patterns, principles such as rewriting rules or regular expressions. Nondiscrete methods are based on statistical/probabilistic techniques or neural networks. Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 The choice of the method depends on a number of criteria. Task: what should be extracted? (topics, names, terms, relations, template fillers, answers) What are the performance criteria of the application? (recall, precision, efficiency) What are the design properties of the system: maintainability, adaptivity, interoperability, scalability, etc? Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Deep methods are accurate but lack efficiency and often also coverage (recall). Shallow methods are generally more efficient than deep ones and usually also often support wider coverage and therefore recall. Many discrete methods are well-suited for the manual design of rule or pattern sets. Non-discrete methods offer the advantage of modelling soft regularieties, especially the weighted contributions of several factors to classification decisions. For many non-discrete methods, very effective machine-learning schemes exist, supporting adaptivity and scalability. Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 discrete (symbolic) non-discrete (statistical) shallow weighted FSTs FSTs SVM SVM HMMs chunk parser CFG Parser deep PCFG Parser HPSG Parser HPSG Parser with statistical parse selection Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 discrete (symbolic) non-discrete (statistical) shallow SVM SVM FSTs deep Hybrid Approaches Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 discrete (symbolic) non-discrete (statistical) shallow FSTs deep HPSG Parser Hybrid Approaches Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 discrete (symbolic) non-discrete (statistical) shallow SVM SVM FSTs deep HPSG Parser Hybrid Approaches Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Shallow NLP • Shallow NLP • Coarse-grained linguistic analysis tailored to a specific domain • Less structure • Direct mapping from linguistic to domain structure • Widely used approach • Bottom-up, chunk recognition (named entities, NP, verb groups) • Template filling on basis of domain-specific verb-frames or other patterns Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a Japanese trading house to produce golf clubs to be supplied to Japan. 5HFRJQL]HGDVDMRLQWYHQWXUHHYHQWE\DSSO\LQJWHPSODWHSDWWHUQ [COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPANY] Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Shallow Techniques • Use of finite state transducers which map from surface form to IE relevant concepts and relations • Systems and platforms • SRI Technologies • FASTUS • Sheffield • GATE • DFKI IE core technologies • IE from real-world German free texts • SPROUT Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Why Finite State Technology? • most of the relevant local phenomena can be easily expressed as finite-state devices • • time & space efficiency • there exist much more powerful formalisms like context free grammars or unification grammars, but application developers prefer more pragmatic solutions existence of efficient determinization, minimization and optimization transformations Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 ').,)LQLWH6WDWH7RROV2YHUYLHZ • • efficiency-oriented implementation architecture and functionality is mainly based on the tools developed by AT&T •representation: textual format vs. compressed binary format 0 1.0 ----------------------0 1 a b 1.0 0 2 a c 2.0 ----------------------1 3.0 2 • 4.0 most of the provided operations are based on recent approaches (Mohri, Pereira, Roche, Schabes) Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 ').,)LQLWH6WDWH7RROV2YHUYLHZ • operations are divided into four main pools: converting operations: - converting textual representation into binary format and vice versa - creating graph representation for FSMs, etc. rational and combination operations: - union - reversion - concatenation - intersection - closure - inversion - local extension - composition equivalence transformations: - determinization - epsilon removal - bunch of minimization algorithms - trimming other: - extending input/output alphabet - collecting arcs with identical labels - determinicity test - information about an FSM Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 ').,)LQLWH6WDWH7RROV([DPSOHV • Keywords Recognition - Automata Search Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 ').,)LQLWH6WDWH7RROV([DPSOHV • Keywords Recognition - Deterministic Search Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 ').,)LQLWH6WDWH7RROV([DPSOHV • Keywords Recognition - Minimal Deterministic Search Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 ').,)LQLWH6WDWH7RROV([DPSOHV • contextual PART-of-SPEECH filtering rule: if the previous word form is a GHWHUPLQHU and the next word form is a QRXQ then filter out the YHUEUHDGLQJ example: ... die bekannten Bilder .... („WKHNQRZQSLFWXUHV³) Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Traditional IE Architecture Tokenization Local Text Analysis Morphological and Lexical Processing Parsing Discourse Analysis Course: Intelligent Information Extraction Text Sectionizing and Filtering Part of Speech Tagging Word Sense Tagging Fragment Processing Fragment Combination Scenario Pattern Matching Coreference Resolution Inference Template Merging Neumann & Xu Esslli Summer School 2004 FASTUS Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 FASTUS • Acronym for )LQLWH6WDWH$XWRPDWRQ7H[W8QGHUVWDQGLQJ 6\VWHP • Developed in SRI International, Menlo Park, California • Joined MUC-4 (92), MUC-5 (93), MUC-6 (95), MUC-7 (97) • English and Japanese • Inspired by • the performance in MUC-3 that the group at the University of Massachusetts got out of a simple system [Lehnert et al., 1991] • Pereira’s work on finite-state approximations of grammars [Pereira, 1990] • Works as a set of cascaded and nondeterministic finitestate automata Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 FASTUS Architecture Text Complex Words (Tokenization) Recognizing Phrases Basic Phrases Complex Phrases MRLQWYHQWXUHVHWXS %ULGJVWRQH6SRUWV7DLZDQ&R QRXQJURXSV YHUEJURXSV KDGVHWXS DSSRVLWLYHDWWDFKPHQW : 3HWHUWKHGLUHFWRU SSDWWDFKPHQW Domain Event Patterns (parsing) Merging Structures (discourse analysis) Course: Intelligent Information Extraction Neumann & Xu : DORFDOFRQFHUQ : Esslli Summer School 2004 : SURGXFWLRQRILURQDQGZRRG Example of Phrase Recognition Salvadoran President-elect Afredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of the crime 1RXQ*URXS 6DOYDGRUDQ3UHVLGHQWHOHFW 1DPH $OIUHGR&ULVWLDQL 9HUE*URXS FRQGHPQHG 1RXQ*URXS WKHWHUURULVWNLOOLQJ 3UHSRVLWLRQ RI 1RXQ*URXS $WWRUQH\*HQHUDO 1DPH 5REHUWR*DUHLD$OYDUDGR &RQMXQFWLRQ DQG 9HUE*URXS DFFXVHG 1RXQ*URXS WKH)DUDEXQGR0DUWL1DWLRQDO/LEHUDWLRQ)URQW)0/1 3UHSRVLWLRQ RI 1RXQ*URXS WKHFULPH Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Example of Domain Event Pattern Recognition [Salvadoran President-elect Afredo Cristiani]2 condemned [the terrorist killing of Attorney General Roberto Garcia Alvarado]1 and [accused the Farabundo Marti National Liberation Front (FMLN) of crime]2 7ZRSDWWHUQVDUHUHFRJQL]HG 3HUSHWUDWRU!NLOOLQJRI+XPDQ7DUJHW! *RYW2IILFLDO!DFFXVHG3HUS2UJ!RI,QFLGHQW! 7ZRLQFLGHQWVWUXFWXUHVDUHFRQVWUXFWHG ,QFLGHQW .,//,1* 3HUSHWUDWRU ÄWHUURULVW³ &RQILGHQFH BB +XPDQ7DUJHW Ä5REHUWR*DUFLD$OYDUDGR³ ,QFLGHQW .,//,1* 3HUSHWUDWRU )0/1 &RQILGHQFH $FFXVHGE\$XWKRULWLHV +XPDQ7DUJHW BB Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 “pseudo-syntax” The material between the end of the subject noun group and the beginning of the main verb group must be read over, for example, 9 Read over prepositional phrases and relative clauses 1. Subject {Preposition NounGroup}* VerbGroup 2. Subject Relpro {NounGroup | Other }* VerbGroup The mayor, who was kidnapped yesterday, was found dead today. Conjoined verb phrase, skipping over the first conjunct and associate the subject with the verb group in the second conjunct Subject VerbGroup {NounGroup|Other}* Conjunction VerbGroup Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Problem of “pseudo-syntax” 6DPHVHPDQWLFFRQWHQWFDQEHUHDOL]HGLQGLIIHUHQW IRUPV *0PDQXIDFWXUHVFDUV &DUVDUHPDQXIDFWXUHGE\*0 *0ZKLFKPDQXIDFWXUHVFDUV FDUVZKLFKDUHPDQXIDFWXUHGE\*0 FDUVPDQXIDFWXUHGE\*0 *0LVWRPDQXIDFWXUHFDUV &DUVDUHWREHPDQXIDFWXUHGE\*0 *0LVDFDUPDQXIDFWXUHU 4XHVWLRQ+RZPDQ\UXOHVDUHQHHGHGWRH[WUDFWDOO UHOHYDQWSDWWHUQV":K\QRWXVLQJDOLQJXLVWLFWKHRU\" • • • 3HUIRUPDQFHYV&RPSRWHQFH 7$&,786KRXUVWRSURFHVV0HVVDJHV )$6786PLQXWHVWRSURFHVVPHVVDJHV Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Example of Template Merging Salvadoran President-elect Afredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of crime Two incident structures are merged as Incident: Perpetrator: Confidence: Human Target: KILLING „terrorist“ __ „Roberto Garcia Alvarado Course: Intelligent Information Extraction Neumann & Xu Incident: Perpetrator: Confidence: Human Target: Incident: Perpetrator: Confidence: Human Target: KILLING FMLN Accused by Authorities __ KILLING FMLN Accused by Authorities „Roberto Garcia Alvarado Esslli Summer School 2004 GATE Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 *$7( ² D*HQHUDO$UFKLWHFWXUHIRU7H[W (QJLQHHULQJ • $QDUFKLWHFWXUH A macro-level organisational picture for LE software systems. • $IUDPHZRUN For programmers, GATE is an object-oriented class library that implements the architecture. • $GHYHORSPHQWHQYLURQPHQW For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • 6RPHIUHHFRPSRQHQWV ...and wrappers for other people's components • 7RROV for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. • )UHH software (LGPL). Download at http://gate.ac.uk/download/ Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 DFKI IE Core Technologies Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Information Extraction from real-world German text SMES Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Domain Adaptive IE Architecture 0$,1*2$/6 7H[W • systematic treatment of domainindependent and domain-specific knowledge • robust and fast shallow text processor (mildly deep -) • abstract level of linking between linguistic and domain knowledge ,(FRUH6\VWHP 6KDOORZ 7H[W 3URFHVVRU 8)' .QRZOHGJH RQWRORJ\ OH[LFRQ 7HPSODWH *HQHUDWLRQ ,QVWDQWLDWHG7HPSODWHV Course: Intelligent Information Extraction Neumann & Xu 'RPDLQ Esslli Summer School 2004 underspecified (partial) functional descriptions UFDs 8)' : IODWGHSHQGHQF\EDVHGVWUXFWXUHRQO\XSSHUERXQGVIRUDWWDFKPHQWDQGVFRSLQJ [PNDie Siemens GmbH] [Vhat] [year1988][NPeinen Gewinn] [PPvon 150 Millionen DM], [Compweil] [NPdie Auftraege] [PPim Vergleich] [PPzum Vorjahr] [Cardum 13%] [Vgestiegen sind]. ³7KHVLHPHQVFRPSDQ\KDVPDGHDUHYHQXHRIPLOOLRQPDUNVLQVLQFHWKH RUGHUVLQFUHDVHGE\FRPSDUHGWRODVW\HDU´ 6XEM hat 33V &RPS weil 2EM Siemens 6& {1988, von(150M)} steigen Gewinn 33V 6XEM {im(Vergleich),zum(Vorjahr),um(13%)} Course: Intelligent Information Extraction Neumann & Xu Auftrag Esslli Summer School 2004 Major Shallow Components • C++ Toolkit for weighted finite state transducers (WFST) • e.g., determination, minimization, concatenation, local extension (following advanced approaches developed at AT&T, Xerox) • compact and efficient internal representations • Dynamic tries for lexical & morphological processing • recursive traversal (e.g., for compound & derivation analysis) • robust retrieval (e.g., shortest/longest suffix/prefix) • Parameterizable XML-output interface • Both tools are portable across different platforms (Unix & Course: Intelligent Extraction LinuxInformation & Windows NT) Neumann & Xu Esslli Summer School 2004 GRPDLQLQGHSHQGHQWVKDOORZWH[WSURFHVVLQJFRPSRQHQWV (;$03/(UXQGELV3UR]HQWGHU6WHLJHUXQJVUDWH ASCII Documents DERXWWRSHUFHQWJURZWKUDWH UXQGORZZ 7RNHQL]HU LQW FODVVHV ZRUGIRUPV /H[LFDO3URFHVVRU /LQJXLVWLF .QRZOHGJH 3RRO 326)LOWHULQJ 6WHLJHUXQJVUDWHVWHLJHUXQJ>V@UDWH ELVSUHS_DGY ELVSUHS RQOLQHFRPSRXQGV K\SKHQFRRUGLQDWLRQ 2YHU5XOHVDOVR%ULOOµV7%/ 6FKDEHV5RFKHDSSURDFK VXEJUDPPDUV 1DPHG(QWLW\)LQGHU UXQGELV3UR]HQWSHUFHQWDJH13 G\QDPLFOH[LFRQ UHIHUHQFHUHVROXWLRQ 7H[W&KDUW 3KUDVH5HFRJQL]HU &ODXVH5HFRJQL]HU UXQGELV3UR]HQW13 GHU6WHLJHUXQJVUDWH13 *HQHULF13V33V9*V 'LYLGHDQG&RQTXHUFKXQNSDUVHU )LQLWHVWDWH XML Documents ;0/2XWSXW,QWHUIDFH SDUDPHWUL]DEOH Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 WHFKQRORJ\ 7RNHQL]HU • The goal of the TOKENIZER is to: - map sequences of consecutive characters into word-like units (WRNHQV) - identify the type of each token disregarding the context - performing word segmentation when necessary (e.g., splitting contractions into multiple tokens if necessary) • overall more than 50 classes (proved to simplify processing on higher stages) - • NUMBER_WORD_COMPOUND (“HU”) ABBREVIATION (CANDIDATE_FOR_ABBREVIATION), COMPLEX_COMPOUND_FIRST_CAPITAL („$77&KLHI“) COMPLEX_COMPOUND_FIRST_LOWER_DASH („Gµ,WDOLD&KHIV³ represented as single WFSA (406 KB) Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 /H[LFDO3URFHVVRU • Tasks of the LEXICAL PROCESSOR: - retrieval of lexical information - recognition of FRPSRXQGV („Autoradiozubehör“ - FDUUDGLRHTXLSPHQW) - K\SKHQFRRUGLQDWLRQ („Leder-, Glas-, Holz- und Kunstoffbranche“ OHDWKHUJODVVZRRGHQDQGV\QWKHWLFPDWHULDOVLQGXVWU\ • • lexicon contains currently more than 700 000 German full-form words (tries) each reading represented as triple <STEM,INFLECTION,POS> example: „wagen“ (WRGDUH vs. DFDU) STEM: „wag“ INFL: (GENDER: m,CASE: nom, NUMBER: sg) (GENDER: m,CASE: akk, NUMBER: sg) (GENDER: m,CASE: dat, NUMBER: sg) (GENDER: m,CASE: nom, NUMBER: pl) (GENDER: m,CASE: akk, NUMBER: pl) (GENDER: m,CASE: dat, NUMBER: pl) (GENDER: m,CASE: gen, NUMBER: pl) POS: noun STEM: „wag“ INFL: (FORM: infin) (TENSE: pres, PERSON: anrede, NUMBER: sg) (TENSE: pres, PERSON: anrede, NUMBER: pl) (TENSE: pres, PERSON: 1, NUMBER: pl) (TENSE: pres, PERSON: 3, NUMBER: pl) (TENSE: subjunct-1, PERSON: anrede, NUMBER: sg) (TENSE: subjunct-1, PERSON: anrede, NUMBER: pl) (TENSE: subjunct-1, PERSON: 1, NUMBER: pl) (TENSE: subjunct-1, PERSON: 3, NUMBER: pl) (FORM: imp, PERSON: anrede) POS: verb Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 3DUWRI6SHHFK)LOWHULQJ • The task of POS FILTER is to filter out unplausible readings of ambiguous word forms • • large amount of German word forms are ambiguous (20% in test corpus) contextual filtering rules (ca. 100) - example: Ä6LHEHNDQQWHQGLHEHNDQQWHQ %LOGHUJHVWRKOHQ]XKDEHQ³ 7KH\FRQIHVVHGWKH\KDYHVWROHQWKHIDPRXVSLFWXUHV ÄEHNDQQWHQ³ WRFRQIHVV YVIDPRXV FILTERING RULE: if the previous word form is determiner and the next word form is a noun then filter out the verb reading of the current word form • supplementary rules determined by Brill‘s tagger in order to achieve broader coverage • rules represented as FSTs, hard-coded rules (filtering out rare readings) Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 1DPHG(QWLW\)LQGHU • The task of the NAMED ENTITY FINDER is the identification of: - entities: organizations, persons, locations - temporal expressions: time, date - quantities: monetary values, percentages, numbers • Identification in two steps: - recognition patterns expressed as WFSA are used to identify phrases containing potential candidates for named entities - additional constraints (depending on the type of a candidate) are used for validating the candidates and an appropriate extraction rule is applied in order to recover the named entity H[DPSOH : „von knapp neun Milliarden auf über 43 Milliarden Spanische Pesetas“ IURPDOPRVWQLQHELOOLRQVWRPRUHWKDQELOOLRQVVSDQLVK SHVHWDV TYPE: monetary SUBTYPE: monetary-prepositional-phrase • Longest match strategy Course: Intelligent Information Extraction Neumann & Xu SPPC - 26 Esslli Summer School 2004 1DPHG(QWLW\)LQGHUFRQW • Arcs of the WFSAs are predicates on lexical items: (a) 675,1*V, holds if the surface string mapped by current lexical item is of the form V (b) 67(0V, holds if: the current lexical item has a preferred reading with stem Vor the current lexical item does not have preferred reading, but at least one reading with stem V (c) 72.(1[, holds if the token type of the surface string mapped by current lexical ´ item is [ • Example: simple automaton for recognition of company names additional constraint: disallow determiner reading for the first word candidate: „Die Braun GmbH & Co.“ extracted: „Braun GmbH & Co.“ Course: Intelligent Information Extraction Neumann & Xu SPPC - 27 Esslli Summer School 2004 1DPHG(QWLW\)LQGHUFRQW • Additional lexica for geographical names, first names (persons) and company names compiled as WFSA (new token classes) • • • Named entities may appear without designators (companies, persons) Dynamic lexicon for storing named entities without designators Candidates for named entities, example: 'D IOFKWHQVLFK GLHHLQHQ LQV$XVODQGZLHHWZDGHU0QFKQHU 6WULFNZDUHQKHUVWHOOHU0lU] *PE+ RGHUGHUEDGLVFKH6WUXPSIIDEULNDQW $UOLQJWRQ 6RFNV*PE+$ENRPPHQGHP-DKUVWULFNW0lU] NQDSSGUHL9LHUWHOVHLQHU 3URGXNWLRQ LQ8QJDUQ . • Resolution of type ambiguity using the dynamic lexicon: if an expression can be a person name or company name (0DUWLQ0DULHWWD&RUS.) then use type of last entry inserted into dynamic lexicon for making decision Course: Intelligent Information Extraction Neumann & Xu SPPC - 28 Esslli Summer School 2004 Performance of Shallow Text Processing for German • %DVLV corpus of German business magazine „Wirtschaftswoche“ (1,2MB, 197118 tokens) • 3HUIRUPDQFH ~32sec. ( ~6160 wrds/sec; PentiumIII, 500MHz, 128Ram) • (YDOXDWLRQ (20.000 tokens) 5HFDOO Precision • compound analysis: 98.53% 99.29% • partpart-ofof-speechspeech-filterung: 74.50% 96.36% • Named entity (including NE reference resolution; all 85% R, 95.77% 95.77% P) • person names: 81.27% • companies: 67.34% • locations: 75.11% • total: 73.94% 95.92% 96.69% 88.20% 94.10% fragments (NPs, PPs): 91.94% • 76.11% Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 chunk parsing strategies for the improvment of robustness and coverage on the sentence level Text (morph. analysed) Phrase recognition Stream of phrases Clause recognition grammatical fct. recognition &XUUHQWFKXQNSDUVHU bottom-up: first phrases and then sentence structure main problem: even recognition of simple sentence structure depends on performance of phrase recognition example: - complex NP (QRPLQDOL]DWLRQVW\OH) - relative pronouns >'LHYRP%XQGHVJHULFKWVKRIXQGGHQ:HWWEHZHUEHUQDOV 9HUVWRVVJHJHQGDV.DUWHOOYHUERWJHJHLVVHOWH]HQWUDOH79 Stream of sentences 9HUPDUNWXQJ@ LVWJlQJLJH3UD[LV (>FHQWUDOWHOHYLVLRQPDUNHWLQJFHQVXUHGE\WKH*HUPDQ )HGHUDO+LJK&RXUWDQGWKHJXDUGVDJDLQVWXQIDLUFRPSHWLWLRQ DVDQDFWRIFRQWHPSWDJDLQVWWKHFDUWHOEDQ@ LVFRPPRQ SUDFWLFH ) Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 A new chunk parser 'LYLGHDQGFRQTXHUVWUDWHJ\ Text (morph. analysed) %XQGHVJUHQ]VFKXW]DEHUQLFKWEHVWlWLJHQ NHLQHQ*ODXEHQVFKHQNH topological structures [ .LQNHOVSUDFKYRQ+RUURU]DKOHQ ] [ 'LHVH$QJDEHQNRQQWHGHU [ Field Recognizer [ first compute topological structure of sentence second apply phrase recognition to the fields GHQHQHU ]]. 7KLVLQIRUPDWLRQFRXOGQµWEHYHULILHGE\WKH%RUGHU3ROLFH .LQNHOVSRNHRIKRUULEOHILJXUHVWKDWKHGLGQµWEHOLHYH Phrase Recognizer sentence structures Gramm. Functions (YDOXDWLRQ 400 sentences (6306 words) Verb groups: Clause struct.: 98.59% F 91.62% F All components: 87.14% F PRUHGHWDLOVLQ0DVWHUWKHVLVRI&%UDXQDQG 1HXPDQQ%UDXQ3LVNRUVNL$1/3 Fct. descriptions Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 The divide-and-conquer parser: a series of finite state grammars 6WUHDPRIPRUSKV\QZRUGV 1DPHG(QWLWLHV 7RSRORJLFDO6WUXFWXUH Weil die Siemens GmbH, die vom Export lebt, Verluste erlitt, mußte sie Aktien verkaufen. %HFDXVHWKH6LHPHQV&RUSZKLFKVWURQJO\GHSHQGVRQ H[SRUWVVXIIHUHGIURPORVVHVWKH\KDGWRVHOOVRPHVKDUHV 9HUE*URXSV %DVH&ODXVHV &ODXVH&RPELQDWLRQ Weil die Siemens GmbH, die vom Export Verb-FIN, Verluste VerbFIN, Modv-FIN sie Aktien FV-Inf. Weil die Siemens GmbH, Rel-Clause Verluste Verb-FIN, Modv-FIN sie Aktien FV-Inf. 0DLQ&ODXVHV Subconj-Clause, Modv-FIN sie Aktien FV-Inf. 3KUDVH5HFRJQLWLRQ Clause 8QGHUVSHFLILHGGHSHQGHQF\WUHHV Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 $GYDQFHG0XOWLOLQJXDO,QIRUPDWLRQ([WUDFWLRQZLWK 63UR87 ± 6KDOORZ3URFHVVLQJZLWK8QLILFDWLRQDQG7\SHG)HDWXUH6WUXFWXUHV Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 0RWLYDWLRQIRU63UR87 • RQHSODWIRUPIRUGHYHORSPHQWRIPXOWLOLQJXDO673V\VWHPV • • • ZHOOGRFXPHQWHGSURJUDPPLQJ$3, IOH[LEOHLQWHUIDFHWRGLIIHUHQWSURFHVVLQJPRGXOHV • • • GHYHORSPHQWWHVWLQJHQYLURQPHQWIRUUHVRXUFHV 7)6V DVDEVWUDFWLQWHUFKDQJHIRUPDW ;0/EDVHGUHSUHVHQWDWLRQ ILQGJRRGWUDGHRIIEHWZHHQHIILFLHQF\DQGH[SUHVVLYHQHVV • • ZRUGEDVHGILQLWHVWDWHGHYLFHV W\SHGXQLILFDWLRQEDVHGJUDPPDUV Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87 ;7'/)RUPDOLVP Æ &RPELQHVW\SHGIHDWXUHVWUXFWXUHV7)6DQGUHJXODUH[SUHVVLRQV LQFOXGLQJ FRUHIHUHQFHV DQGIXQFWLRQDODSSOLFDWLRQ Æ ;7'/JUDPPDUUXOHV± SURGXFWLRQSDUWRQ/+6DQGRXWSXW GHVFULSWLRQRQ5+6 Æ &RXSOHRIVWDQGDUGUHJXODURSHUDWRUV FRQFDWHQDWLRQ RSWLRQDOLW\ " GLVMXQFWLRQ _ .OHHQH VWDU .OHHQH SOXV QIROGUHSHWLWLRQ ^Q` PQVSDQUHSHWLWLRQ ^PQ` QHJDWLRQa Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87 $UFKLWHFWXUH /,1*8,67,& ,1387'$7$ /(;,&$/ 352&(66,1* 5(6285&(6 5(6285&(6 -7)6 675($02) 7(;7,7(06 ;7'/ ,17(535(7(5 6758&785(' &203,/(5 5(*8/$5 ;7'/ *5$00$5 «>@>@>@« 287387'$7$ ),1,7( ),1,7(67$7( 0$&+,1( 722/.,7 *5$00$5'(9(/230(17 (19,5210(17 21/,1(352&(66,1* Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87 ;7'/)RUPDOLVP SS!PRUSK>3263UHS685)$&(SUHS,1)/>&$6(F@@ PRUSK>326'HW,1)/>&$6(F180%(5Q*(1'(5J@@" PRUSK>326$GMHFWLYH,1)/>&$6(F180%(5Q*(1'(5J@@ PRUSK>326QRXQ685)$&(QRXQB ,1)/>&$6(F180%(5Q*(1'(5J@@ PRUSK>326QRXQ685)$&(QRXQB ,1)/>&$6(F180%(5Q*(1'(5J@@" !SKUDVH>&$7SS 35(3SUHS $*5DJU >&$6(F180%(5Q*(1'(5J@ &25(B13FRUHBQS@ ZKHUHFRUHBQS $SSHQGQRXQB´³QRXQB Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87 ;7'/*UDPPDU3URFHVVLQJ Æ 7KUHHVWHSSURFHVVLQJ 0DWFKLQJRIUHJXODUSDWWHUQVXVLQJ XQLILDELOLW\ /+6 5XOHLQVWDQFHFUHDWLRQ 8QLILFDWLRQRIWKHUXOHLQVWDQFHDQGPDWFKHGLQSXWVWUHDP Æ /RQJHVWPDWFKVWUDWHJ\ Æ $PELJXLWLHVDOORZHG Æ 2SWLPL]DWLRQWHFKQLTXHV 3UREOHP7)6VWUHDWHGDVV\PEROLFYDOXHVE\)607RRONLW 6RUWLQJRXWJRLQJWUDQVLWLRQVIURPVHOHFWHG VWDWHV WUDQVLWLRQ KLHUDUFK\XQGHU VXEVXPSWLRQ Æ 5XOHSULRULWL]DWLRQ Æ 2XWSXWPHUJLQJ Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87 3URFHVVLQJ5HVRXUFHV 7RNHQL]DWLRQ )LQHJUDLQHGWRNHQFODVVLILFDWLRQRYHUFODVVHV 'RPDLQDQGODQJXDJHVSHFLILFWRNHQVXEFODVVLILFDWLRQ SURFACE : " Berlin" TYPE : first_capital_word SUBTYPE : city_sufix Æ 7RNHQSRVWVHJPHQWDWLRQ 6XSSRUWVDOPRVWDOO(XURSHDQFKDUDFWHUVHWV 0LQRUDGMXVWPHQWVQHFHVVDU\ZKLOHSRUWLQJWRQHZODQJXDJH HJZRUGBZLWKBDSRVWURSKHFODVVLW¶V!LWµV Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87 3URFHVVLQJ5HVRXUFHV Æ 0RUSKRORJ\ )XOOIRUPOH[LFD0PRUSKYVH[WHUQDOPRUSKRORJ\FRPSRQHQWV 2QOLQHVKDOORZFRPSRXQGUHFRJQLWLRQ &XUUHQWFRYHUDJH /DQJXDJH &RYHUDJH (QJOLVK *HUPDQ )UHQFK 'XWFK ,WDOLDQ 6SDQLVK 3ROLVK OH[HPHV ([WUDV FRPSRXQGUHFRJQLWLRQ FRPSRXQGUHFRJQLWLRQ FD0LOZRUGIRUPV &]HFK -DSDQHVHVH FD)PHDVXUH :RUGVHJPHQWDWLRQ326BWDJJHU &KLQHVH :RUGVHJPHQWDWLRQ326BWDJJHU Course: Intelligent Information Extraction Neumann & Xu 0RUSK'LVDPELJXDWLRQ Esslli Summer School 2004 63UR87 3URFHVVLQJ5HVRXUFHV Æ *D]HWWHHU $OORZVIRUDVVRFLDWLQJHQWULHVZLWKDOLVWRIDUELWUDU\DWWULEXWHYDOXH SDLUV $UJHQW\Q\_*$=B7<3(FRXQWU\_&21&(37$UJHQW\QD_)8//B1$0( 5HSXEOLND$UJHQW\Q\ _&$6(JHQLWLYH_&$3,7$/%XHQRV$LUHV_FRQWLQHQW 5HXVDJHRIDYDLODEOHUHVRXUFHVDFURVVODQJXDJHV Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 SURFACE : " Washington" SURFACE : " Washington" 0XOWLSOHUHDGLQJV : " Washington" SURFACE TYPE : gaz_state , TYPE : gaz_city TYPE : gaz_surname CONCEPT : c_washington_cityCONCEPT : c_washington_state 63UR87 3URFHVVLQJ5HVRXUFHV Æ 5HIHUHQFH0DWFKHU )LQGVLGHQWLW\UHODWLRQEHWZHHQUHFRJQL]HGHQWLWLHVDQGXQFRQVXPHG WH[W IUDJPHQWV «&(23URI'U0DUWLQ0DULHWWD PHQWLRQHGQHZWDNHRYHU«««« «««««««««««««««««««««««««««««« «««««0DUWLQ0DULHWWD*PE+SODQVWR««««««««««« «««««««««««««««««««««««««««««« «««««««0DULHWWD VDLG«««««««««««««««« *UDPPDULDQVGHILQHKRZYDULDQWVDUHEXLOW SHUVRQ!ILUVWBQDPH ODVWBQDPHWLWOH !>),567B1$0(ILUVWBQDPH/$67B1$0(ODVWBQDPH 7,7/(WLWOH9$5,$17YDULDQW@ ZKHUHYDULDQW ^&RQFWLWOHODVWBQDPHODVWBQDPH` 3DUDPHWUL]DEOH&RQWH[WXDO)UDPH Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87 ,'( Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87 ,'( Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87'HYHORSPHQW (QYLURQPHQW Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87± ,PSOHPHQWDWLRQ$YDLODELOLW\ Æ ,PSOHPHQWDWLRQ Æ &RUH&RPSRQHQWV )607RRONLW&-$9$$3,V -7)6-$9$ ;0/EDVHG*HQHUDO3XUSRVH5HJXODU&RPSLOHU -$9$& /LQJXLVWLF3URFHVVLQJ&RPSRQHQWV-$9$ ,QWHJUDWHG'HYHORSPHQW(QYLURQPHQW-$9$ $YDLODELOLW\ ,'(&RUH&RPSRQHQWV:LQGRZV/LQX[ -$9$$3,V IRULQWHJUDWLRQLQRWKHUIUDPHZRUNV )UHHZDUH" Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87± $SSOLFDWLRQV Æ ,QGXVWULDOSURMHFWV &&$± &XVWRPHU&DUH$XWRPDWLRQ7HOHNRP 4$6\VWHPZLWKGLDORJIDFLOLWLHVIRUWHOHFRPPXQLFDWLRQ GRPDLQ $*526HUYHU&HOL65/ 0RQLWRULQJDQG0LQLQJ&XVWRPHU2ULHQWDWLRQ KWWSZZZFHOLLW Æ ,QWHUQDOSURMHFWV ([WUDOLQN ,QIRUPDWLRQ6\VWHPFRPELQLQJ,QIRUPDWLRQ([WUDFWLRQDQG +\SHUOLQNLQJ IRU7RXULVWLFGRPDLQ Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 63UR87± $SSOLFDWLRQV Æ (8IXQGHGSURMHFWV $,5)25&($,5)25H&DVW LQ(XURSH 7RROVIRUIRUIRUHFDVWLQJDLUWUDIILFLQ(XURSH 0(03+,60XOWLOLQJXDO&RQWHQWIRU)OH[LEOH)RUPDW,QWHUQHW 3UHPLXP6HUYLFHV 3ODWIRUPIRUFURVVOLQJXDOSUHPLXPFRQWHQWVHUYLFHVIRUWKLQ FOLHQWV KWWSZZZLVWPHPSKLVRUJ '((37+28*+7 +\EULGV\VWHPFRPELQLQJVKDOORZDQGGHHSPHWKRGVIRU NQRZOHGJH LQWHQVLYHLQIRUPDWLRQH[WUDFWLRQ KWWSZZZHXULFHGHGHHSWKRXJKW Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Combining Deep and Shallow NLP for IE Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Isn‘t SNLP enough? • Many simple NL tasks only need SNLP • E.g., text clustering, Named Entity recognition, term extraction, binary relation extraction • Information extraction: 60% f-measure barrier in case of scenario template extraction (e.g., PDQDJHPHQWVXFFHVVLRQComplex phenomena: • • • • Grammatical function/deep case recognition Free word order in German Multiple sentences Reference resolution for template merging • Content-oriented Web-services/Semantic Web • More precise structural extraction needed • E.g., ontology extraction, question/answering systems • Shallow systems are getting deeper, requesting more and more accurate linguistic information • integration of deep NL components Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Possible Integration Scenario The parallel approach The Whiteboard approach • ❏ Run SNLP and DNLP in parallel • • • ZKHUH FRPSRQHQWVFDQSOD\DWWKHLU prefer results from DNLP Problem: different processing speed when processing large quantities of text: ,QWHJUDWHGIOH[LEOHDUFKLWHFWXUH VWUHQJWKV ❏ &DOO'1/3RQUHVXOWVRI61/3 5HVXOWVIURP 61/3 XVHG WR LGHQWLI\UHOHYDQWFDQGLGDWHVIRU Precision: slowest component ⇒ effects overall speed RQGHPDQG XVHRI'1/3 GRPDLQ VSHFLILFFULWHULD • Speed: overall precision hardly distinguishable from shallowonly system ⇒ DNLP always time out 5REXVWQHVV KDQGOHGE\ 61/3 WR LQFUHDVHWKHFRYHUDJHRI'1/3 8VH61/3WRJXLGH'1/3 WRZDUGV WKHPRVWOLNHO\V\QWDFWLFDQDO\VLV OHDGLQJ WRLPSURYHG VSHHGXS Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Combining the Best of the Two Worlds for IE utilization of SNLP as primary linguistic resources for • robustness, efficiency and domain-specific interpretation of local relationships integration of DNLP, if the analysis exists and it is relevant, for extracting • relationships embedded in complex linguistic constructions,e.g., free word order, long distance dependencies, control and raising, or passive combination of finite-state technology with unification-based formalism for • definition of template filling rules, scenario templates and template elements • using unification and subsumption check as basic operations for template merging Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 A Simple Example After the retirement of Peter Smith, Mary Hopp was persuaded to take over the development sector. 1] shallow: retirement of 1 PN [Person_Out deep: Pred Agent Theme Person_In Sector „take over“ 2 „Mary Hopp“ 3 „development sector“ 2 3 Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Head-Driven Pharse Structure Grammar (HPSG) • A constraint-based, lexicalist approach to grammatical theory • Model languages as systems of constraints • Linguistic units and their relations are expressed by typed feature structures • No transformations for unbounded dependencies • Multiple inheritance hierarchies for lexicon organization • More information see • http://lingo.stanford.edu/erg.html • http://www.coli.uni-sb.de/%7Ehansu/courses/hpsg_arch.html • http://www.coli.uni-sb.de/%7Ehansu/psfiles/hpsg.ps Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 A Phrase Structure Example Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 HPSG Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 WHIE: A Hybrid IE Approach • hybrid information extraction architecture built on top of Whiteboard Annotation Machine (WHAM) • domain modeling for hybrid architecture • semi-automatic construction of template filling rules • heuristics for domain/relevance-driven access to different levels of WHAM • evaluation • hybrid approach • improvement after integration of deep NLP Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 ,QWHJUDWLRQRI6KDOORZDQG'HHS$QDO\VLV *RDORILQWHJUDWHGµK\EULG¶V\QWDFWLFSURFHVVLQJ ¾ ¾ 5REXVWQHVVDQGHIILFLHQF\ of shallow analysis 3UHFLVLRQDQGILQHJUDLQHGQHVV of deep syntactic analysis /H[LFDO,QWHJUDWLRQ • SPPC-HPSG interface: building HPSG lexicon entries “on the fly” – Named entities, open class categories (nouns, adjectives, adverbs, ..) • HPSG-GermaNet integration (Siegel, Xu, Neumann 2001) – association with HPSG lexical sorts ¾ FRYHUDJHDQGUREXVWQHVV 3KUDVDOLQWHJUDWLRQIRUµK\EULG¶V\QWDFWLFSURFHVVLQJ • • Robust, stochastic topological field parsing Integration of shallow topological and deep syntactic parsing ¾ HIILFLHQF\DQGUREXVWQHVV Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 ,QWHJUDWLRQRI6KDOORZDQG'HHS$QDO\VLV — Problems and Solutions — ,QWHJUDWHGRIVKDOORZDQGGHHSSDUVLQJ ¾ ¾ Guiding deep analysis by shallow (chunk-based) pre-partitioning Interleaved (hybrid) shallow/deep processing 7KHGHHSVKDOORZPDSSLQJSUREOHP • Chunk parsing not isomorphic to deep syntactic structure („attachments“) Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Template Extraction System (Xu & Krieger, 2003) Type Hierarchy Discourse Template Merging Template Merging Rules Sentential Template Merging Shallow TF (SProUT) Pattern-Based TF Rules Subsumption Check and Unification Deep TF Chunks PredArgs PredArg-Based TF Rules WHAM Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Typed Feature Structure (TFS) for IE task • $VDIRUPDOODQJXDJHIRUWKHPRGHOLQJRIWKHGRPDLQ YRFDEXODU\DQGWKHLUUHODWLRQVLQWKHW\SHGKLHUDUFK\ Template elements Scenario templates Linguistic units Domain ontologies Rules for template filling and template merging • '\QDPLFRQOLQHFRQVWUXFWLRQRIWHPSODWHV • 7)6XQLILHUVXSSRUWVZHOOIRUPHGXQLILFDWLRQRI7)6V DQGVXEVXPSWLRQFKHFN Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Template filling with deep and shallow analysis (Example) ,QWHUDFWLRQRISDVVLYHFRQWURODQGPXOWLSOHQDPHGHQWLWLHV %HFDXVHRIWKHUHWLUHPHQWYRQ3HWHU0OOHU +DQV%HFNHUZDVSHUVXDGHGWRWDNHRYHUWKHGHYHORSPHQWVHFWRU 3DWWHUQEDVHG7)UXOH6KDOORZ7) UHWLUHPHQWYRQ31 1 32 1 3(5621B287´3HWHU0OOHUµ 'HHS7)5XOH 3(5621B,1 1 ',9,6,21 2 '((3 35(' $*(17 7+(0( 3(5621B287 ´3HWHU0OOHUµ 3(5621B,1 ´+DQV%HFNHUµ ',9,6,21 ´GHYHORSPHQW VHFWRUµ Course: Intelligent Information Extraction 25*$1,6$7,21 Neumann & Xu 326,7,21 Esslli Summer School 2004 ´WDNHRYHUµ 1´+DQV%HFNHUµ 2 ´GHYHORSPHQW VHFWRUµ Semi-Automatic Construction of Template Filling Rules learning relevant terms (Xu et al. 2002) − a KFIDF term-classification method − verb-noun collocations adaptation of relevant terms to hybrid architecture − 6KDOORZ : adjectives and nouns as triggers for pattern-based template filling rules, e.g., The previous president of car supplier Kolbenschmidt, Heinrich Binder Successor of the officeholder Hans Guenter Merk − 'HHS : verbs as triggers for lexicalized unification-based template filling rules, e.g., General manager Eugen Krammer (59), …, will resign from his office on May 31. 1997 Ö if a domain is dominated by relevant verbs, it is helpful to integrate DNLP Ö domains can be Course: Intelligent Information Extraction Neumann & Xu to POSs characterized by the distribution of terms with respect Esslli Summer School 2004 Single-Word Term Extraction a specific TFIDF measure: KFIDF a word is relevant if it occurs more frequently than other words in a certain category and rarely in other categories. .),') ( Z, FDW ) = GRFV ( Z, FDW ) × /2* ( Q × FDWV FDWV ( Z) + 1) GRFVZFDW = number of documents in the category FDW containing the word Z Q = smoothing factor FDWVZ = the number of categories FDWV in which the word Z occurs $VLPLODUPHDVXUHLVDOVRXVHGLQ>%XLWHODDU6DFDOHDQX Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 @ Term Classification Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Experiment • Input: classified documents • Domains: stock market, management succession and narcotics Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Stock Market PRVWUHOHYDQWWHUPVDUHQRXQV $NWLHQE|UVH 9HUlQGHUXQJ *HZLQQHU 9HUOLHUHU +RFKWLHI 7LHI &DUERQ $NWLH .XUV [stock-exchange] [change] [winner] [loser] [up and down] [deep] [carbon] [stocks] [stock price] Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Management Succession PRVWUHOHYDQWWHUPVDUHYHUEV EHUXIHQ ZlKOHQ EHUQHKPHQ EHVWHOOHQ YHUODVVHQ ZHFKVHOQ DXVVFKHLGHQ QDFKIROJHQ ]XUFNWUHWHQ DQWUHWHQ [appoint to] [choose] [accept] [nominate] [leave] [change] [resign] [succeed] [resign] [assume office] Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Learning Relevant Verb Noun Collocations YHUEQRXQFROORFDWLRQV 5XKHVWDQG /HLWXQJ [retirement], WUHWHQ[step] [leadership], EHUQHKPHQ [take over] *HUPDQIUHHZRUGRUGHU Ö not sufficient to take into account only bigrams, trigrams, e.g., the word order between verbs and nouns is not fixed and very often discontinuous Ö consider all possible term pairs in a sentence ignoring the linear order - verb noun - adjective noun - noun noun XVLQJGLIIHUHQWDVVRFLDWLRQPHDVXUHV(YHUW .UHQQ Ö mutual information (Church & Hanks, 1989) Ö T-test (Daille, 1996) Ö Log-Likelihood (Manning & Schütze, 1999) Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Integrating DNLP on Demand If a sentence contains relevant verbs and verbnoun collocations in addition to other terms, the sentence should be passed to DNLP, otherwise, SNLP will be sufficient Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Evaluation of Template Extraction • both at sentential level • proof of • usefulness of hybrid architecture • how much improvement deep NLP helps to achieve • whether domain modeling gives right predication Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Evaluation at Sentential Level: Corpus • construction of an evaluation corpus based on the following criterion Ö the more relevant verbs and nouns a sentence contains, the more relevant is the sentence in this domain. 19 56 = 19 ∗ + OJ ∑ L = 11 597 L + 11 ∗ + OJ ∑ M = 517 M where 19> and 11> . 56 : relevance of a sentence 19: number of relevant verbs in a sentence 11: number of relevant nouns in a sentence 597 : the relevance of a verb term i occurring in the sentence 517 : the relevance of a noun term j occurring in the sentence L M • selection of 50 top relevant sentences in management succession domain Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Annotation (manual) partially filled templates based on shallow and deep template filling rules construction of an idealized case, which is not affected by the performance of the current system <sentence id=1> VKDOORZ! Dr. Heinz Neckar ... appointed ... manager of …, as successor of Dr. Herbert Ehrlich, retired. <template name=”management_succession”> <slot name=”person_out”>'U+HUEHUW(KUOLFK </slot> </template> VKDOORZ! GHHS! <template name=”management_succession”> <slot name=”person_in”>'U+HLQ]1HFNDU </slot> </template> <template name=”management_succession”> <slot name=”person_out”>'U +HUEHUW(KUOLFK </slot> </template> GHHS! </sentence> Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Evaluation at Sentential Level: Scenarios • applying sentential template merging to • shallow templates only • deep templates only • fusion of shallow and deep templates Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Evaluation Results 6KDOORZ 'HHS XQLRQ6KDOORZ'HHS FRYHUDJH UHFDOO FRYHUDJH UHFDOO FRYHUDJH UHFDOO 0.56 0.31 0.94 0.68 0.96 0.92 coverage= the percentage of sentences, to which the template filler rules can be applied precision=1.0 (manual annotation) ¾ verbs play a more important role than nouns and adjectives in management succession domain ⇒ domain modeling ¾ information extracted by shallow and deep template filling rules is complementary to each other in most cases ⇒ hybrid architecture Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004 Additional Important Issues • Build rich and robust semantic representation for IE • Domain specific discourse analysis for template merging • Anaphora resolution • Ontology inference • Evaluation Course: Intelligent Information Extraction Neumann & Xu Esslli Summer School 2004