Improving Search through Corpus Profiling Bonnie Webber School of Informatics University of Edinburgh Scotland 1 Original motivation PhD research (Michael Kaisser) on using (lexical resources (FrameNet, PropBank, VerbNet) to improve performance in QA Developed two methods [Kaisser & Webber, 2007] Evaluation on Web and AQUAINT corpus produced significantly different results. Other research where same methods on same input produce significantly different results on different corpora. FrameNet Example of annotated FrameNet data: (Screenshots from framenet.icsi.berkeley.edu) Two QA methods Method 1 Use resources to generate templates in which answer might be found. Project templates onto quoted strings used directly as search queries. Method 2 Use resources to generate dependency structures in which answer might occur. Search on lexical co-occurance. Filter results by comparing structure of candidate sentences with the structure of the annotated resource sentences. Method 1 Example: “Who purchased YouTube?” Method 1 Extract simplified dependency structure from question using MiniPar: head: head\subj: head\obj: purchase.v “Who” “YouTube” Method 1 Get annotated sentences from FrameNet for purchase.v: The company FE:Buyer had purchased several PDMS terminals ... lexical unit FE: Goods Method 1 The company FE:Buyer had purchased several PDMS terminals ... lexical unit FE: Goods Use MiniPar to associate annotated abstract frame structure with dependency structure: Buyer[Subject, NP] VERB Goods[Object, NP] Method 1 Buyer[Subject, NP] VERB Goods[Object, NP] head=purchase.V, Subject=“Who”, Object=“YouTube” Buyer[ANSWER] purchase.V Goods[“YouTube”] Method 1 Generate potential answer templates: ANSWER[NP] purchased YouTube ANSWER[NP] (has|have) purchased YouTube ANSWER[NP] had purchased YouTube YouTube (has|have) been purchased by ANSWER[NP] ... Method 1 Use patterns to generate quoted strings as search queries: "YouTube has been purchased by" Extract sentences from snippets. Parse sentences. If structures match, extract answer: “YouTube has been purchased by Google for $1.65 billion.” Method 1 (extended) The company had FE:Buyer ... purchased several PDMS terminals LU FE:Goods the landowner sold the land to developers FE:Seller LU FE:Goods FE:Buyer ... ... Create additional paraphrases using all verbs in original frame & verbs identified through inter-frame relations: ANSWER[NP] bought YouTube YouTube was sold to ANSWER[NP] Method 1 (Web-based evaluation) Accuracy results on 264 (of 500) TREC 2002 questions whose head verb is not "be": FN base all verbs in frame inter-frame relations 0.181 0.204 0.215 FN base PropBank VerbNet combined 0.181 0.227 0.223 0.261 Method 1 (further extension) FN often gives ‘interesting’ examples rather than common ones. So assume (as default) that verbs display common patterns: Intransitive: [ARG0] VERB Transitive: [ARG0] VERB [ARG1] Ditransitive: [ARG0] VERB [ARG1] [ARG2] And if one of these patterns is observed in Q that isn’t among those found in FN, just add it. combined 0.261 combined+ 0.284 Method 1 Method 1 and its extensions all lead to clear improvements in QA over the web, But they may be losing answers by finding only exact string matches. “YouTube was recently purchased by Google for $1.65 billion.” Method 2 addresses this. Method 2 Associates each annotated sentence in FN and PB with a set of dependency paths from the head to each of the frame elements. “The Soviet Union[ARG0] has purchased roughly eight million tons of grain[ARG1] this month[TMP]”. head: “purchase”, path = /i ARG0: paths = {./s, ./subj,} ARG1: paths = {./obj} TMP: paths = {./mod} Method 2 Question analysis: Same as Method 1. Search based on key words from question: purchased YouTube (no quotes) Sentences are extracted from the returned snippets, e.g.: “Their aim is to compete with YouTube, which Google recently purchased for more than $1 billion.” Dependency parse produced for each extract. Method 2 Eight tests comparing dependency paths: 1a Do the candidate and example sentences share the same head verb? 1b Do the candidate and example sentences share the same path to the head? 2a In the candidate sentence, do we find one or more of the example’s paths to the answer role? 2b In the candidate sentence, do we find all of the example’s paths to the answer role? Method 2 3a Can some of the paths for the other roles be found in the candidate sentence? 3b Can all of the paths for the other roles be found in the candidate sentence? 4a Do the surface strings of the other roles partially match those of the question? 4b Do the surface strings of the other roles completely match those of the question? Method 2 Each sentence that passes steps 1a and 2a is assigned a weight of 1. (Otherwise 0.) For each of the remaining tests that succeeds, that weight is multiplied by 2. Method 2 Annotated frame sentence (from PropBank): “The Soviet Union[ARG0] has purchased roughly eight million tons of grain[ARG1] this month[TMP]”. Candidate sentence retrieved from the Web: “Their aim is to compete with YouTube, which Google recently purchased for more than $1 billion.” N.B. Object rel clause - string match would fail. Method 2 Candidate sentence: head: “purchase, ”path = /i/pred/i/mod/pcomp-n/rel/i phrase: “Google”, paths = {./s, ./subj,} phrase: “which”, paths = {./obj} phrase: “YouTube”, paths = {\i\rel} phrase: “for more than $1 billion”, paths = {./mod} PropBank example sentence: head: “purchase”, path = /i ARG0: “The Soviet Union”, paths = {./s, ./subj,} ARG1: “roughly eight million tons of grain ”, paths = {./obj} TMP: “this month”, paths = {./mod} Method 2 Candidate sentence: head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i phrase: “Google”, paths = {./s, ./subj} phrase: “which”, paths = {./obj} phrase: “YouTube”, paths = {\i\rel} phrase: “for more than $1 billion”, paths = {./mod} PropBank example sentence: head: “purchase”, path = /i ARG0: “The Soviet Union”, paths = {./s, ./subj} ARG1: “roughly eight million tons of grain”, paths = {./obj} TMP: “this month”, paths = {./mod} Method 2 Candidate sentence: head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i phrase: “Google”, paths = {./s, ./subj} phrase: “which”, paths = {./obj} phrase: “YouTube”, paths = {\i\rel} phrase: “for more than $1 billion”, paths = {./mod} The results of the tests are: 1a OK 2a OK 3a OK 4a - 1b - 2b OK 3b OK 4b - This sentence returns the answer “Google”, to which a score of 8 is assigned. Method 2 Candidate sentence: head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i phrase: “Google”, paths = {./s, ./subj} phrase: “which”, paths = {./obj} phrase: “YouTube”, paths = {../..} phrase: “for more than $1 billion”, paths = {./mod} We get a (partially correct) role assignment: ARG0: “Google ”, paths = {./s, ./subj} ARG1: “which”, paths = {./obj} TMP: “for more than $1 billion”, paths = {./mod} Method 2 Evaluation results for method 2: FrameNet 0.030 PropBank 0.159 PropBank outperforms FrameNet because: More lexical entries in PropBank More example sentences per entry in PropBank FrameNet does not annotate peripheral adjuncts Evaluation 21% improvement on the 264 non-’be’ TREC 2002 questions, when used on the web. Method 1 – FrameNet 0.181 Method 1 – PropBank 0.227 Method 1 – VerbNet 0.223 Method 1 – all resources 0.261 Method 2 – PropBank 0.159 All methods – PropBank 0.306 All methods – all resources 0.367 Problem Similar levels of improvement were not found when applied directly to the AQUAINT corpus, using the exact same methods. Method 1 - FrameNet 0.181 0.027 Method 2 - PropBank 0.159 0.023 Not an isolated case Across 9 different IR models, [Iwayama et al, 2003] found similar differences when posing the same queries to a corpus of Japanese patent applications (full text) a corpus of Japanese newspaper articles tf .0227 .1054 idf .1577 .2443 log(tf) .1255 .2266 log(tf).idf .2132 .2853 BM25 .2503 .3346 But they don’t speculate on the reason for these results. What makes for such differences? In Kaisser’s case, the form in which information appears in the corpus may match neither the question nor any form derivable from it via FrameNet, PropBank or VerbNet. What year was Alaska purchased? On March 30, 1867, U.S. Secretary of State William H. Seward reached agreement with Russia to purchase the territory of Alaska for $7.2 million, a deal roundly ridiculed as Seward's Folly. (APW20000329.0213) But by 1867, when Secretary of State William H. Seward negotiated the purchase of Alaska from the Russians, sweetheart deals like that weren't available anymore.' (NYT19980915.0275) Hypothesis Profiling a corpus and adapting search to its characteristics can improve performance in IR and QA. Neither new nor surprising: “Genre, like a range of other non-topical features of documents, has been under-exploited in IR algorithms to date, despite the fact that we know that searchers rely heavily on such features when evaluating and selecting documents” [Freund et al, 2006]. Also cf. [Argamon et al, 1998; Finn & Kushmerik, 2006; Karlgren 2004; Kessler et al, 1997] What basis for profiling? Documents can be characterised in terms of genre register domain These in turn implicate lexical choice syntactic choice choice of referring expression structural choices at the document level formatting choices Definitions Genre, register, domain are not completely independent concepts. Definitions Genre, register, domain are not completely independent concepts. Genre: A distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form [Orlikowski & Yates, 1994]. Definitions Genre, register, domain are not completely independent concepts. Genre: A distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form [Orlikowski & Yates, 1994]. Register: Generalized stylistic choices due to situational features such as audience and discourse environment [Morato et a., 2003] Definitions Genre: A distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form [Orlikowski & Yates, 1994]. Register: Generalized stylistic choices due to situational features such as audience and discourse environment [Morato et a., 2003] Domain: The knowledge and assumptions held by members of a (professional) community. Assumptions In IR, it seems worth characterizing documents directly as to genre (and possibly register). Doing so automatically requires characterising inter alia significant linguistic features. For QA, further benefits will come from profiling the lexical, syntactic, referential, structural and formatting consequences of genre, register and domain, and exploiting these features directly. Direct use of genre [Freund et al, 2006], [Yeung et al, 2007] Analysed behavior of software engineering consultants looking for documents they need in order to provide technical services to customers using the company’s software product. A range of genres identified through both user interviews and analysis of the websites and repositories they used Direct use of genre Manuals Presentations Product documents Technotes, tips Tutorials and labs White papers Best practices Design patterns Discussions/forums …. Direct use of genre Requires manually labelling each document with its genre or recognizing its genre automatically. The latter requires characterising genres in terms of automatically recognizable features. Best practice: Description of a proven methodology or technique for achieving a desired result, often based on practical experience. Form: primarily text, many formats, variable length Style: imperatives, “best practice” Subject matter: new technologies, design, coding Direct use of genre (X-Site) Prototype workplace search tool for software engineers currently in use [Yeung et al, 2007]. Provides access to ~8GB of content crawled from the Internet, intranet and Lotus Notes data. Exploits Task profiles Task-genre associations: known +/_/relationships between task and genre pairs Automatic genre classifier Using genre, register, domain in QA Answers to Qs can be found anywhere, not just in documents on the specific topic. Q: When Did the Titanic Sink? Twelve months have passed since 193 people died aboard the Herald of Free Enterprise. But time has not eased the pain of Evelyn Pinnells, who lost two daughters when the ferry capsized off Belgium. They were among the victims when the Herald of Free Enterprise capsized off the Belgian port of Zeebrugge on March 6, 1987. It was the worst peacetime disaster involving a British ship since the Titanic sank in 1912. Using genre, register, domain in QA For this reason, IR for QA differs from general IR, using (instead) passage retrieval, quoted strings, etc. For the same reason, one may not want to prefilter documents by genre, register or domain labels (as seems useful for IR). Rather, it may be beneficial to exploit features of and patterns in the linguistic features that realize genre, register and domain. What are those features? Lexical features Register strongly affects word choice: MedLinePlus: “runny nose” PubMed: “rhinitis”, “nasopharyngeal symptoms” Clinical notes: “greenish sputum” UMLS: Informal “greenish” doesn’t appear [Bodenreider & Pakhomov, 2004] Domain also affects word choice: “smoltification” occurs ~600 times in a corpus of 1000 papers on salmon, while none in AQUAINT [Gabbay & Sutcliffe, 2004]. Lexical features Register strongly effects type/token ratios: Only 850 core words (+ inflections) in Basic English, so type/token ratio is very small. Federalist papers: ~0.36 King James Version: And God said, Let the waters bring forth abundantly the moving creature that hath life, and fowl that may fly above the earth in the open firmament of heaven. Bible in Basic English: And God said, Let the waters be full of living things, and let birds be in flight over the earth under the arch of heaven. Lexical features IR4QA using either keywords or quoted strings for passage retrieval could benefit from responding to both types of lexical divergence between question and corpus. Syntactic Features: Voice Active The Grid provides an ideal platform for new ontology tools and data bases, … Users log-in using a password which is encrypted using a public key and private key mechanism. Passive Ontologies are recognized as having a key role in data integration on the computational Grid. We store ontology files in hierarchical collections, based on user unique identifiers, ontology identifiers, and ontology version numbers. Syntactic Features Passive voice is used significantly more often in the physical sciences than in the social sciences [Bonzi, 1990]. Active (%) Passive (%) Air pollution 65.5% 34.5% Infectious disease 66.3% 33.7% Ed administration 77.9% 22.1% Social Psychology 76.6% 23.4% Syntactic Features Passives also used significantly often in surgical reports [Bross et al, 1972] and repair reports. For agentive verbs, missing agent is surgeon (or surgical team) or repair person: “… the skin was prepared and draped .. Incision was made .. Axillary fat was dissected and bleeding controlled …” But not for non-agentive verbs. Syntactic Features DURING NORMAL START CYCLE OF 1A GAS TURBINE, APPROX 90 SEC AFTER CLUTCH ENGAGEMENT, LOW LUBE OIL AND FAIL TO ENGAGE ALARM WERE RECEIVED ON THE ACC. (ALL CONDITIONS WERE NORMAL INITIALLY). SAC WAS REMOVED AND METAL CHUNKS FOUND IN OIL PAN. LUBE OIL PUMP WAS REMOVED AND WAS FOUND TO BE SEIZED. DRIVEN GEAR WAS SHEARED ON PUMP SHAFT. CasRep: Maintenance & repair report Syntactic Features: Clause type Main clause Relative clause In this case, the user (j.bard) has created a private version of the CARO ontology which is shared with stuart.aitken … Participial clause Users log-in using a password which is encrypted using a public key and private key mechanism. Infinitive clause … Syntactic Features Relative clauses used significantly more often in the social than the physical sciences [Bonzi, 1990]. Participial clauses used significantly more often in the physical than the social sciences [Bonzi, 1990]. Rel Clauses (%) Participial Clauses (%) Air pollution 20.5% 29.6% Infectious disease 25.2% 28.1% Ed administration 29.2% 18.0% Social Psychology 33.0% 18.6% Why might this matter? Not all the arguments to verbs in different types of clauses are explicit. The same techniques cannot be used to recover them. For relative clauses, syntax (attachment) suffices. For participial (and main) clauses, more general context must be assessed. The missing argument could be what answers the question. Structural features Document structure variation across genres is greater than variation within genres. “inverted pyramid” structure of a news article IMRD structure of a scientific article step structure of instructions ingredients list + step structure of recipes SOAP structure of clinical records (and systems structure within Objective section) Why might this matter? Can suggest where to look for information. In news articles, information that defines terms is more likely to be found near the beginning than the end [Joho and Sanderson 2000]. In scientific articles, position isn’t a good indicator for definitional material [Gabbay & Sutcliffe 2004]. Why might this matter? If information isn’t in its intended section, one might conclude that it’s false, unnecessary, irrelevant, etc., depending on the function of the section. Chocolate chips absent from an ingredients list in recipe. No mention of “irregular heart beat” in report on CV system in Objective section of SOAP note. Conclusion The effectiveness of a given technique for IR, QA, IE can vary significantly, depending on the corpus it’s applied to. For search among docs (IR), genre and register are clear factors in user relevance decisions. For search within docs (QA, IE), they appear significant as well. In particular, search that is sensitive to genre and register may yield better performance than search that isn’t. References O Bodenreider, S Pakhomov (2003). Exploring adjectival modification in biomedical discourse across two genres. ACL Workshop on Biomedical NLP, Sapporo, Japan. S Argamon, M Koppel, G. Avneri (1998). Routing documents according to style. First International Workshop on Innovative Information Systems, Boston, MA. L Freund, C Clarke, E Toms (2006).Towards genre classification for IR in the workplace. First Symposium on Information Interaction in Context (IIiX). I Gabbay, R Sutcliffe (2004). A qualitative comparison of scientific and journalistic texts from the perspective of extracting definitions. ACL Workshop on QA in Restricted Domains. Sydney. M Iwayama, A Fujii, N Kando, Y Marukawa (2003). An empirical study on retrieval models for different document genres: Patents and newspaper articles. SIGIR’03, 251-258. H Joho, M Sanderson (2000). Retrieving descriptive phrases from large amounts of free text. Proc. 9th Intl Confenrence on Information and Knowledge Management (CIKM), pp 180-186. References J Karlgren (1999). Stylistic experiments in information retrieval. In T. Strzalkowski (Ed.), Natural language information retrieval. Dordrecht, The Netherlands: Kluwer. M Kaisser, B Webber (2007). Question Answering based on semantic roles, ACL/EACL Workshop on Deep Linguistic Processing, Prague CZ. Kessler, B., Nunberg, G., & Schutze, H. (1997). Automatic detection of text genre. Proc. 35th Annual Meeting, Association for Computational Linguistics (pp. 32-38). P Yeung, L Freund, C Clarke (2007). X-Site: A Workplace Search Tool for Software Engineers. SIGIR’07 demo, Amsterdam. Thank you! 61