Avg. Doc Size - Columbia University

AUTOMATIC IDENTIFICATION OF INDEX TERMS FOR INTERACTIVE BROWSING Nina Wacholder David K. Evans Judith L. Klavans 1st author's affiliation 2nd author's affiliation 3rd author's affiliation 1st line of address 1st line of address 1st line of address 2nd line of address 2nd line of address 2nd line of address Telephone number, incl. country code Telephone number, incl. country code Telephone number, incl. country code nina@cs.columbia.edu devans@cs.columbia.edu ABSTRACT Indexes -- structured lists of terms that provide access to document content -- have been around since before the invention of printing [31]. But most text content in digital libraries cannot be accessed through indexes. The time and expense of manual preparation of standard indexes is prohibitive. The quantity and variety of the data and the ambiguity of natural language make automatic generation of indexes a daunting task. Over the last decade, there has been an improvement in natural language processing techniques that merits a reconsideration of automatic indexing. This paper addresses the problem of how to provide users of digital libraries with effective access to text content through automatically generated indexes consisting of terms identified by statistically and linguistically informed natural language processing techniques. We have developed LinkIT, a software tool for identifying significant topics in text. [16]. LinkIT rapidly identifies noun phrases in full-documents; each of these noun phrases is a candidate index term. These terms are then sorted based on filters a significance metric called head sorting [39]. Wacholder et al. 2000 [41] shows that these terms are perceived by users as useful index terms. We have also developed a prototype dynamic text browser, called Intell-Index, that supports user driven navigation of index terms. The focus of this paper is on the usefulness of the terms identified by LinkIT in dynamic text browsers like We analyze the suitability of the terms identified by LinkIT for use in an index by analyzing its output over a 250 MB corpus consisting of text in terms of coverage, quality and consistency. We also show how the usefulness of these terms is enhanced by automatically bringing together subsets of related index terms. Finally, we describe Intell-Index, a prototype dynamic text browser for user-driven Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference ’00, Month 1-2, 2000, City, State. Copyright 2000 ACM 1-58113-000-0/00/0000…$5.00. klavans@cs.columbia.edu browsing and navigation of index terms. Keywords Indexing, phrases, natural language processing, browsing, genre. 1. OVERVIEW Indexes are useful for information seekers because they:  support browsing, a basic mode of human information seeking [32].  provide information seekers with a valid list of terms, instead of requiring users to invent the terms on their own. Identifying index terms has been shown to be one of the hardest parts of the search process, e.g., [17].  are organized in ways that bring related information together [31]. In spite of their well established utility as an information access tool, expert indexes are not generally available for digital libraries. Human effort cannot keep up with the steady flow of content. Indexes consisting of automatically identified terms have been legitimately criticized by information professionals such as Mulvany 1994 [31] on the grounds that they constitute indiscriminate lists, rather than synthesized and structured representation of content. ((results)) LinkIT implements a linguistically-motivated technique for the recognition and grouping of simplex noun phrases (SNPs). LinkIT efficiently gathers minimal NPs, i.e. SNPs, and applies a refined set of post-processing rules to these SNPs to link them within a document. The identification of SNPs is performed using a finite state machine compiled from a regular expression grammar, and the process of ranking the candidate significant topics uses frequency information that is gathered in a single pass through the document. Our data consists of 250 megabytes of text taken from the Tipster 1994 CD ROM [36] . Text is from four different sources, the Wall Street Journal, the Associated Press, the Federal Register and Ziff –Davis Publications. We consider several questions related to the usability of the terms extracted by LinkIT from this data in an automatically generated index, as measured both by the quality of the index terms and the number of index terms.  Are automatically identified terms of sufficient quality and consistency to serve as useful access points to text content?  What automatic techniques are available for organizing and sorting these terms in order to bring related terms together and eliminate terms not likely to be relevant for a particular information need?  ((On average, how many useful index terms are identified for individual documents and corpora?)) The research approach that we take in this paper emphasizes fully automatic identification and organization of index terms that actually occur in the text. We have adopted this approach for three reasons: 1. Human indexers cannot keep up with the volume of text. For publications like daily newspapers, it is especially important to rapidly create indexes for large amounts of text. 2. New names and terms are constantly being invented and/or published. For example, new companies are formed (e.g., Verizon Communications Inc.); people’s names appear in the news for the first time (e.g., it is unlikely that Elian Gonzalez’ name was in a newspaper before November 25, 1999); and new product names are constantly being invented (e.g., Handspring’s Visor PDA). These terms frequently appear in print before they appear in reference sources. Therefore, indexing programs must be able to deal with new terms. 3. Techniques that rely on existing resources such as controlled vocabularies are usually not readily adopted for domains where these resources do not exist. For these reasons, the question of how well state-of-the-art natural language processing techniques perform index term identification and organization without any manual intervention is of practical as well as theoretical importance. This paper also makes a contribution toward the development of methods for evaluating indexes and browsing applications. In addition, index terms are useful in applications such as information retrieval, document summarization and classification [45], [2]. The rest of this paper is organized as follows ((to add later)): 2. Automatically identified index terms The promise of automatic text processing for automatic generation of indexes has been recognized for several decades, e.g., Edmundson and Wyllys 1961 [13]. However, the quantity of text and the ambiguity of natural language have made progress in this task more difficult than was originally expected. For example, ((need nice ambiguous example)) In the 1990’s, increased processing power and the decreasing cost of storage space made it feasible to process large quantities of text in a reasonable period of time ((need example)). At the same time, advances in natural language processing have led to new techniques for extraction of information from text. Information identification is the automatic discovery of salient information, concepts, events, and relationships, in full-text documents. Stimulated in part by the US government sponsored MUC (Message Understanding Conference) competitions [11],[12], natural language processing techniques for identifying coherent grammatical phrases, proper names, acronyms and other information in text have been developed. The lists of terms identified by information extraction techniques are not identical to those compiled by human beings. They inevitably contain errors and inconsistencies that look downright foolish to human eyes. It is nevertheless possible to identify coherent lists with precision in the 90%s for certain kinds of terms typically included in indexes. For example, Wacholder et al. 1997 [40] and Bikel et al. 1997 [5] achieve these rates for proper names using rule-based and statistical techniques respectively. Cowie et Lehnert 1996 [7] observe that 90% precision in information extraction is probably satisfactory for every day use of results. The most important development vis a vis identification of indexing terms has been the development of efficient, relatively accurate techniques for identify phrases in text. Most index terms are noun phrases, coherent linguistic units whose head (most important word) is a noun. It is no accident that expert indexes and controlled vocabularies consist primarily of noun phrases, as a scan of almost any back-of-the-book index will show. The widespread availability of part-of-speech taggers, systems that identify the grammatical category, makes it possible to identify lists of noun phrases and other meaningful units in documents (e.g., [8], [37], [1], [15]). Jacquemin et al. 1998 [23] have used derivational morphology to achieve greater coverage of multiword terms for indexing and retrieval. In addition, there has been a great deal of research on identifying subgroups of noun phrases such as proper names [5],[33],[38], technical terminology [10],[24],[6],[22] and acronyms [28], [44]. These are the types of expressions that are traditionally included in indexes. There is also an emerging body of work on development of interactive systems to support phrase browsing (e.g., Anick and Vaithyanathan 1997 [2], Gutwin et al. [19], Nevill-Manning et al. 1997 [32], Godby and Reighart 1998 [18]). This work shares with the work described here the goal of figuring out how to best use phrases to improve information access and how to structure lists of phrases identified in documents. But in spite of a long history of indexes as an information access tool, there has been relatively little research on indexing usability, an especially important topic vis a vis automatically generated indexes [20][30]. There are a lot of issues that need to be explored: the difference between using a printed index and an electronic index, useful numbers of index terms for browsing, user issues… ((In the rest of this paper, we focus on noun phrases, and on techniques for automatically subclassifying, organizizing and structuring them for presentation to users.)) 3. Automatically identified noun phrases as index terms In this section, we consider two issues related to automatic identification of noun phrases as related to indexing and browsing applications. 1. What is the right type of noun phrase to use? 2. What subtypes of noun phrases are available for sorting indexes? The first problem that concerns us is what the optimal definition of noun phrase is for indexing purposes. This is more than a theoretical question because noun phrases as they occur in natural language are recursive, that is noun phrases occur within noun phrases. For example, the complex noun phrase a form of cancer- causing asbestos actually includes two simplex noun phrases, a form and cancer-causing asbestos. A system that lists only complex noun phrases would list only one term, a system that lists both simplex and complex noun phrases would list all three phrases, and a system that identifies only simplex noun phrases would list two. A human indexer would choose whichever type is appropriate for each term, but natural language processing systems cannot yet do this reliably. From the point of view of natural language processing systems, For large indexes, Because of the ambiguity of natural language, it is much easier to identify the boundaries of simplex noun than complex ones [39]. The decision to focus on simplex noun phrases rather than complex ones is therefore made for purely practical reasons. 4. Multiple views of index terms Index terms are organized in ways that represent either the content of a single document or the content of multiple documents. For example, Figure 1 shows the terms in the order in which they occur in the Wall Street Journal article WSJ0003 (Penn Treebank). The index terms in Figure 1 were extracted automatically by the LinkIT system as part of the process of identification of all noun phrases in a document (Evans 1998 [15]; Evans et al. 2000[16]). The first column is the sentence number, the second column represents the token span. (This information will be suppressed below.) Table 1: Topics, in document order, extracted from first sentence of wsj0003 Sentence Tokens Noun Phrase S1 1-2 A form S1 4-4 asbestos S1 9-11 Kent cigarette filters S1 14-16 a high percentage S1 18-19 cancer deaths S1 21-22 a group S1 24-24 workers S1 30-31 30 years S1 33-33 researchers For most people, it is not difficult to guess that this list of terms has been extracted from a discussion about deaths from cancer in workers exposed to asbestos. One reason is that the list takes the form of noun phrases, a linguistically coherent unit that is meaningful to people. It is no coincidence that manually created indexes usually are composed of lists of noun phrases. The information seeker is able to apply common sense and general knowledge of the world to interpret the terms and their possible relation to each other. At least for a short document, a complete list of terms extracted from a document in order can relatively easily be browsed in order to get a sense of the topics discussed in a single document. But in general such ordered lists have limited utility, and other methods for organizing phrases are needed. One way that the number of terms is reduced is by use of the Head Sorting method (Wacholder 1998), an automatic technique for approximating the identification of significant terms performed by humans in manual indexing. The candidate index terms in a document, i.e., the noun phrases identified by LinkIT, are sorted by head, the element that is linguistically recognized as semantically and syntactically the most important. The terms are ranked in terms of their significance based on frequency of the head in the document. After filtering based on significance ranking and other linguistic information, the following topics are identified as most important in WSJ0003 (Wall Street Journal 1988, available from the Penn Treebank; heads of terms are italicized). Table 2: Most significant terms in document, based on head frequency asbestos workers cancer-causing asbestos cigarette filters researcher(s) asbestos fiber crocidolite paper factory This list of phrases (which includes heads that occur above a frequency cutoff of 3 in this document, with content-bearing modifiers, if any) is a list of important concepts representative of the entire document. Another view of the phrases identified by LinkIT is obtained by linking noun phrases in a document with the same head. A single word noun phrase can be quite ambiguous, especially if it is a frequently-occurring noun like worker, state, or act. Noun phrases grouped by head are likely to refer to the same concept, if not always to the same entity (Yarowsky 1993) [43], and therefore can convey the primary sense of the head as used in the text. For example, in the sentence “Those workers got a pay raise but the other workers did not”, the same sense of worker is used in both noun phrases even though two different sets of workers are referred to. Figure 3 shows how the word workers is used as the head of a noun phrase in four different Wall Street Journal articles from the Penn Treebank; determiners such as a and some have been removed. Table 3: Comparison of uses of worker as head of noun phrases across articles workers … asbestos workers (wsj 0003) workers … private sector workers … private sector hospital workers ... nonunion workers…private sector union workers (wsj 0319) workers … private sector Steelworkers (wsj 0592) workers … United workers … United Auto Workers … hourly production and maintenance workers (wsj0492) terms, only a few of which will be of interest to a particular user. This contrasts with information retrieval systems, where the user typically specifies their information need through the query. Out of the context of an individual user’s information need, it is hard to determine what terms should be included. The standard evaluation metrics, precision and recall are not helpful here because ((…)). This view distinguishes the type of worker referred to in the different articles, thereby providing information that helps rule in certain articles as possibilities and eliminate others. This is because the list of complete uses of the head worker provides explicit positive and implicit negative evidence about kinds of workers discussed in the article. For example, since the list for wsj_0003 includes only workers and asbestos workers, the user can infer that hospital workers or union workers are probably not referred to in this document. We therefore assess the quality of our terms by two different measurements. First we assess the quality of terms randomly extracted from our 257MB corpus. 475 index terms (0.25) % of the total were randomly extracted from the corpus and alphabetized. Then we gave each term one of three ratings:  useful -- a term is arguably a coherent noun phrase. Therefore it makes sense as a distinct unit, even out of context. Examples of good terms are sudden current shifts, Governor Dukakis, and terminal-to-host connectivity. The three figures above show just a few of the ways that automatically identified terms are organized and filtered in our system. We have developed a prototype Dynamic Text Browser, called Intell-Index, that allows users to interactively sort and browse terms that we have identified. Figure 1 on p.8 shows the Intell-Index opening screen.  useless – a term is neither a noun phrase nor coherent. Examples of useless terms identified by the system are uncertainty is, x ix limit, and heated potato then shot. Most of these problems result from idiosyncratic or unexpected text formatting. Another source of problems is the part-ofspeech tagger; for example, if it erroneously identifies a verb as a noun (as in the example uncertainty is), the resulting term is not a noun phrase.  intermediate – any term that does not clearly belong in the useful or useless categories. Typically they consist of one or more good noun phrases, along with some junk. In general, they are enough noun phrase like that in some ways they fit patterns of the component noun phrases. Examples are up Microsoft Windows, which would be good if it didn’t include up. This is a reference that justifies being listed in a list of NPs referring to Windows or Microsoft. Another example is th newsroom, where th is presumably a typographical error for the. ; Acceptance of responsibility 29, which would be fine if it didn’t include the 29. Many of these examples result from the difficulty in deciding where one proper name ends and the next one begins, as in General Electric Co. MUNICIPALS Forest Reserve District. The user first selects the collection to be browsed and then either browse the entire set of index terms identified by LinkIT or a subset. Figure 2 on p.8 shows the browsing screen, where the user may either click on a head, which will give them a list of all of the terms with that head, or on a particular term, which gives only the specific expansion of the head (or only the head) Alternatively, the user may enter a search string, and specify criteria to select a subset of terms that will be returned. This gives the user better control over the search. For example, the user may specify whether or not the terms returned must match the case of the user-specified search string;. This facility allows the user to view only proper names (with a capitalized last word), only common noun phrases, or both. This is an especially useful facility for controlling terms that the system returns. For example, specifying that the a in act be capitalized in a collection of social science or political articles is likely to return a list of laws with the word Act in their title; this is much more specific than an indiscriminate search for the string act, regardless of capitalization. As the user browses the terms returned by Intell-Index, they may choose to view a list of the contexts in which the term is used; these contexts are sorted by document and ranked by normalized frequency in the document. This is a version of KWIC (keyword in context) that we call ITIC (index term in context). Finally, if the information seeker decides that the list of ITICs is promising, they may view the entire document. 5. QUALITY OF TERMS The problem of how to identify what are good enough terms for browsing is a difficult one. Indexes are designed to serve diverse users and information needs, and by design they contain multiple Table 4 shows the ratings by type of term and overall. Proper names includes all terms that look like proper nouns, as identified by a simple regular expression based on capitalization of the head. The acronym category contains all terms that might be acronyms, using a regular expression that selects terms with two or more capitalized letters. While both of these heuristics are crude, they are easily applied to the list of terms, and efficiently separate out the terms into general classes. The common category contains all other terms. (For this study, we eliminated terms that started with non-alphabetic characters. ((the types need to be defined above)) ((Also, I wrote some explanatory text with the tables that I gave you earlier about the utility of breaking the terms into classes, which might fit well here – or a discussion of why we would want to do this.)) Table 4: Quality rating of terms, as measured by comprehensibility of terms. Total Common Proper Acronym Total 392 147 35 574 Useful Moderate Useless 344 18 30 87.8% 4.6% 7.7% 105 36 6 71.4% 24.5% 4.1% 8 1 35 74.3% 22.9% 2.9% 475 62 37 82.8% 10.9% Table 1. The numbers in parenthesis are the number of words per document and per corpus for the full-text columns, and the percentage of the full text size for the noun phrase (NP) and head column. Table 5 Corpus Size Corpus AP 6.5% FR The percentage of junk terms is well under 10%, which puts our results in the realm of being suitable for every day use according to the metric of Cowie and Lehnert mentioned in Section 1. WSJ ZIFF Another way to get at this question is to give users a task and an index and find out how easy/hard it is to find information. We leave this possibility to future research. 6. STATISTICAL ANALYSIS OF TERMS In this section, we consider the problem of quantity of text and quantity of terms as metrics of index usability. Thoroughness of coverage is one of the standard criteria for index evaluation [20]. However thoroughness is a double-edged sword when it comes to computer generated indexes. The ideal is thoroughness of coverage of important terms and no coverage of unimportant terms, though of course the term important is problematic in this context because we assume that different users will be consulting the index for a variety of different purposes and will have different levels of domain knowledge. We do not know of any research which evaluates user processing of lists; however, we assume that larger lists and more diverse lists are harder to process. Diverse lists are harder to process because there the context provides less information about how to interpret the list. ((example)) We also need to establish a uniform standard for evaluation of browsing/indexing tools. Our test corpus consisted of 257 MBs of text from Disks 1 and 2 of the Tipster CD-ROMs prepared for the TREC conferences (TREC 1994 [11]). We used documents from four collections— Associated Press (AP), Federal Register (FR), Wall Street Journal 1990 and 1992 (WSJ) and Ziff Davis (ZD). To allow for differences across corpora, we report on overall statistics and per corpus statistics as appropriate. However, our focus in this paper is on the overall statistics. The first question that we consider is how much the reduction of documents to noun phrases decreases the size of the data to be scanned. For each full-text corpus, we created one parallel version consisting only of all occurrences of all noun phrases (duplicates not removed) in the corpus, and another parallel version consisting only of heads (duplicates not removed), as shown in Full Text Non Unique Nonunique NPs Heads 12.27 MB 7.4 MB 5.7 MB (2.0 million words) (60%) (47%) 33.88 MB 20.7 MB 13.7 MB (5.3 million words) (61%) (41%) 45.59 MB 27.3 MB 18.2 MB (7.0 million wds) (60%) (40%) 165.41 MB 108.8 MB 67.8 MB (26.3 million wds) (66%) (41%) The number of non-unique NPs and Non-unique heads reflects the number of occurrences (tokens) of NPs and heads of NPs. From the point of view of the index, however, this is only a first level reduction in the number of candidate index terms: for browsing and indexing, each term need be listed only once. After duplicates have been removed, approximately 1% of the full text remains for heads, and 22% for noun phrases. (( Should we put this in the table? Or maybe we don’t need this first table at all, but should just state this information.)) This suggests that we should use a hierarchical browsing strategy, using the shorter list of heads for initial browsing, and then using the more specific information in the SNPs when specification is requested. Interestingly, the percentages are relatively consistent across corpora. Table 6 shows the relationship between document size in words and number of noun phrases per document. For example, for the AP corpus, an average document of 476 words will typically have about 127 ((unique—non-unique??) NPs associated with it. can list all nouns; this list is constantly growing, but at a slower rate than the possible number of noun phrases). Table 6 SNP per document Corpus Avg. Doc Size 2.99K AP Avg NPs ((Avg NPs/ /doc word in document?)) 338 In general, the vast majority of heads have two or fewer different possible expansions. There is a small number of heads, however, that contain a large number of expansions. For these heads, we could create a hierarchical index that is only displayed when the user requests further information on the particular head. In the data that we examined, on average the heads had about 6.5 expansions, with a standard deviation of 47.3. 132 ((In this table, we’ve lost a number for the Ziff data. Also, is it really correct that the max for Ziff is 15877?? is this true?)) 129 Table 8: Average number of head expansions per corpus 127 (476 words) 7.70K FR (1175 words) WSJ 3.23K (487 words) ZIFF 2.96K (461 words) Tolle and Chen 2000 [35] identified approximately 140 unique noun phrases per abstract for 10 medical abstracts. They do not report the average length in words of abstracts, but a reasonable guess is probably about 250 words per abstract. In contrast, LinkIT identifies about 130 SNPs for documents of just under 500 words. This is a much more manageable number, and we get the advantage of having an index for the full text, which by definition includes more information than an abstract. ((length issues—not considering here what happens here to index terms as text gets longer, e.g., 20,000 word articles or _____ word book)) ((Dave do unique SNPs and heads maintain capitalization distinctions? what about plural?) increases much faster than the number of unique heads – this can be seen by the fall in the ratio of unique heads to SNPs as the corpus size increases. Table 7 shows another property of natural language—the number of unique noun phrases increases much faster than the number of unique heads – this can be seen by the fall in the ratio of unique heads to SNPs as the corpus size increases. Table 7: Number of Unique SNPs and Heads Corpus Unique NPs Unique Heads Ratio of Unique Heads to NPs AP 156798 38232 24% FR 281931 56555 20% WSJ 510194 77168 15% ZIFF 1731940 176639 10% Total 2490958 254724 10% Corp Max % <= 2 2<% Avg < 50 % >= 50 Std. Dev. AP FR 557 72.2% 26.6% 1.2% 4.3 13.63 1303 76.9% 21.3% 1.8% 5.5 26.95 WSJ 5343 69.9% 27.8% 2.3% 7.0 46.65 ZIFF 15877 75.9% 21.6% 2.5% 10.5 102.38 Additionally, these terms have not been filtered; we may be able to greatly narrow the search space if the user can provide us with further information about the type of terms they are interested in. For example, using simple regular expressions, we are able to roughly categorize the terms that we have found into four categories: regular SNPs, SNPs that look like proper nouns, SNPs that look like acronyms, and SNPs that start with non-alphabetic characters. It would be possible to narrow the index to one of these categories, or exclude some of them from the index. Table 9: Number of SNPs by category Corpus AP # of SNPs 156798 FR 281931 WSJ 510194 ZIFF 1731940 # of Proper Nouns 2490958 # of nonalphabetic elements 20787 2526 12238 (13.2%) (1.61%) (7.8%) 22194 5082 44992 (7.8%) (1.80%) (15.95%) 44035 6295 63686 (8.6%) (1.23%) (12.48%) 102615 38460 (2.22%) (11.16%) (5.9%) Total # of Acronyms 189631 (7.6%) 45966 (1.84%) 193340 300373 (12.06%) Table 7 is interesting for a number of reasons: 1) the variation in ratio of heads to SNPs per corpus—this may well reflect the diversity of AP and the FR relative to the WSJ and especially Ziff. 2) as one would expect, the ration of heads to the total is smaller for the total than for the average of the individual copora. This is because the heads are nouns. (No dictionary Over all of the copora, about 10% of the SNPs start with a nonalphabetic character, which we can exclude if the user is searching for a general term. If we know that the user is searching specifically for a person, then we can use the list of proper nouns as index terms, further narrowing the search space (to approximately 10% of the possible terms.) [12] DARPA (1995) Proceedings of the Sixth 7. CONCLUSION/DISCUSSION Message Understanding Conference (MUC-6). Morgan Kaufmann, 1995. 8. ACKNOWLEDGMENTS [13] Edmundson, H.P. and Wyllys, W. (1961) “Automatic This work has been supported under NSF Grant IRI-97-12069, “Automatic Identification of Significant Topics in Domain Independent Full Text Analysis”, PI’s: Judith L. Klavans and Nina Wacholder and NSF Grant CDA-97-53054 “Computationally Tractable Methods for Document Analysis”, PI: Nina Wacholder. [14] Evans, David A. and Chengxiang Zhai (1996) "Noun-phrase 9. REFERENCES [1] Aberdeen, J., J. Burger, D. Day, L. Hirschman, and M. Vilain (1995) “Description of the Alembic system used for MUC-6". In Proceedings of MUC-6, Morgan Kaufmann. Also, Alembic Workbench, http://www.mitre.org/resources/centers/advanced_info/g04h/ workbench.html. [2] Anick, Peter and Shivakumar Vaithyanathan (1997) “Exploiting clustering and phrases for context-based information retrieval”, Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’97), pp.314323. [3] Baeza-Yates, Ricardo and Berthier Ribeiro-Netto (1999) Modern Information Retrieval, ACM Press, New York. [4] Bagga, Amit and Breck Baldwin (1998). “Entity based crossdocument coreferencing using the vector space model, Proceeding of the 36th Annual Meeting of the Association forComputational Linguistics and the 17th International Conference on Computational Linguistics, pp.79-85. [5] Bikel, D., S. Miller, R. Schwartz, and R. Weischedel (1997) Nymble: a High-Performance Learning Name-finder”, Proceedings of the Fifth conference on Applied Natural Language Processing, 1997. [6] Boguraev, Branimir and Kennedy, Christopher (1998) "Applications of term identification Terminology: domain description and content characterisation”, Natural Language Engineering 1(1):1-28. [7] Cowie, Jim and Wendy Lehnert (1996) “Information extraction”, Communications of the ACM, 39(1):80-91. [8] Church, Kenneth W. (1998) “A stochastic parts program and noun phrase parser for unrestricted text”, Proceedings of the Second Conference on Applied Natural Language Processing, 136-143. [9] Dagan, Ido and Ken Church (1994) Termight: Identifying and translating technical terminology, Proceedings of ANLP ’94, Applied Natural Language Processing Conference, Association for Computational Linguistics, 1994. [10] Damereau, Fred J. (1993) “Generating and evaluating domain-oriented multi-word terms from texts”, Information Processing and Management 29(4):433-447. abstracting and indexing--survey and recommendations”, Communications of the ACM, 4:226-234. analysis in unrestricted text for information retrieval", Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp.17-24. 24-27 June 1996, University of California, Santa Cruz, California, Morgan Kaufmann Publishers. [15] Evans, David K. (1998) LinkIT Documentation, Columbia University Department of Computer Science Report. Available at <http://www.cs.columbia.edu/~devans/papers/LinkITTechDo c/ > [16] Evans, David K., Klavans, Judith, and Wacholder, Nina (2000) “Document processing with LinkIT”, Proceedings of the RIAO Conference, Paris, France. [17] Furnas, George, Thomas K. Landauer, Louis Gomez and Susan Dumais (1987) “The vocabulary problem in humansystem communication”, Communications of the ACM 30:964-971. [18] Godby, Carol Jean and Ray Reighart (1998) “Using machine-readable text as a source of novel vocabulary to update the Dewey Decimal Classification”, presented at the SIG-CR Workshop, ASIS, < http://orc.rsch.oclc.org:5061/papers/sigcr98.html >. [19] Gutwin, Carl, Gordon Paynter, Ian Witten, Craig NevillManning and Eibe Franke (1999) “Improving browsing in digital libraries with keyphrase indexes”, Decision Support Systems 27(1-2):81-104. [20] Hert, Carol A., Elin K. Jacob and Patrick Dawson (2000) “A usability assessment of online indexing structures in the networked environment”, Journal of the American Society for Information Science 51(11):971-988. [21] Hatzivassiloglou, Vasileios, Luis Gravano, and Ankineedu Maganti (2000) "An investigation of linguistic features and clustering algorithms for topical document clustering," Proceedings of Information Retrieval (SIGIR'00), pp.224231. Athens, Greece, 2000. [22] Hodges, Julia, Shiyun Yie, Ray Reighart and Lois Boggess (1996) “An automated system that assists in the generation of document indexes”, Natural Language Engineering 2(2):137-160. [23] Jacquemin, Christian, Judith L. Klavans and Evelyne Tzoukermann (1997) “Expansion of multi-word terms for indexing and retrieval using morphology and syntax”, Proceedings of the 35th Annual Meeting of the Assocation for Computational Linguistics, (E)ACL’97, Barcelona, Spain, July 12, 1997. [11] DARPA (1998) Proceedings of the Seventh Message [24] Justeson, John S. and Slava M. Katz (1995). “Technical Understanding Conference (MUC-7). Morgan Kaufmann, 1998. terminology: some linguistic properties and an algorithm for identification in text”, Natural Language Engineering 1(1):9-27. Large Corpora, Association for Computational Linguistics, June 22, 1993. [25] Klavans, Judith, Nina Wacholder and David K. Evans (2000) [38] Wacholder, Nina, Yael Ravin and Misook Choi (1997) “Evaluation of Computational Linguistic Techniques for Identifying Significant Topics for Browsing Applications” Proceedings of LREC, Athens, Greece. "Disambiguating proper names in text", Proceedings of the Applied Natural Language Processing Conference , March, 1997. [26] Klavans, Judith and Philip Resnik (1996) The Balancing Act, [39] Wacholder, Nina (1998) “Simplex noun phrases clustered by MIT Press, Cambridge, Mass. [27] Klavans, Judith, Martin Chodorow and Nina Wacholder (1990) “From dictionary to text via taxonomy”, Electronic Text Research, University of Waterllo, Centre for the New OED and Text Research, Waterloo, Canada. [28] Larkey, Leah S., Paul Ogilvie, M. Andrew Price, Brenden Tamilio (2000) Acrophile: An Automated Acronym Extractor and Server In Digital Libraries 'Proceedings of the Fifth ACM Conference on Digital Libraries, pp. 205-214, San Antonio, TX, June, 2000. [29] Lawrence, Steve, C. Lee Giles and Kurt Bollacker (1999) “Digital libraries and autonomous citation indexing”, IEEE Computer 32(6):67-71. [30] Milstead, Jessica L. (1994) “Needs for research in indexing”, Journal of the American Society for Information Science. [31] Mulvany, Nancy (1993) Indexing Books, University of Chicago Press, Chicago, IL. [32] Nevill-Manning, Craig G., Ian H. Witten and Gordon W. Paynter (1997) “Browsing in digital libraries: a phrase based approach”, Proceedings of the DL97, Association of Computing Machinery Digital Libraries Conference, 230236. [33] Paik, Woojin, Elizabeth D. Liddy, Edmund Yu, and Mary McKenna (1996) “Categorizing and standardizing proper names for efficient information retrieval”,. In Boguraev and Pustejovsky, editors, Corpus Processing for Lexical Acquisition, MIT Press, Cambridge, MA. [34] Wall Street Journal (1988) Available from Penn Treebank, head: a method for identifying significant topics in a document”, Proceedings of Workshop on the Computational Treatment of Nominals, edited by Federica Busa, Inderjeet Mani and Patrick Saint-Dizier, pp.70-79. COLING-ACL, October 16, 1998, Montreal. [40] Wacholder, Nina, Yael Ravin and Misook Choi (1997) “Disambiguation of proper names in text”, Proceedings of the ANLP, ACL, Washington, DC., pp. 202-208. [41] Wacholder, Nina, David Kirk Evans, Judith L. Klavans (2000) “Evaluation of automatically identified index terms for browsing electronic documents”, Proceedings of the Applied Natural Language Processing and North American Chapter of the Association for Computational Linguistics (ANLP-NAACL) 2000. Seattle, Washington, pp. 302-307. [42] Wright, Lawrence W., Holly K. Grossetta Nardini, Alan Aronson and Thomas C. Rindflesch (1999) “Hierarchical concept indexing of full-text documents in the Unified Medical Language System Information Sources Map”. (()) [43] Yarowsky, David (1993) “One sense per collocation”, Proceedings of the ARPA Human Language Technology Workshop, Princeton, pp 266-271. [44] Yeates, Stuart. “Automatic extraction of acronyms from text”, Proceedings of the Third New Zealand Computer Science Research Students' Conference, pp.117-124, Stuart Yeates, editor. Hamilton, New Zealand, April 1999. [45] Zhou, Joe (1999) “Phrasal terms in real-world applications”. In Natural Language Information Retrieval, edited by Tomek Strazalowski, Kluwer Academic Publishers, Boston, pp.215-259. Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA. [35] Tolle, Kristin M. and Hsinchun Chen (2000) “Comparing noun phrasing techniques for use with medical digital library tools”, Journal of the American Society of Information Science 51(4):352-370. [36] Text Research Collection (TREC) Volume 2 (revised March 1994), available from the Linguistic Data Consortium. [37] Voutilainen, Atro (1993) “Noun phrasetool, a detector of English noun phrases”, Proceedings of Workshop on Very Figure 1 Intell-Index opening screen < http://www.cs.columbia.edu/~nina/IntellIndex/indexer.cgi> Figure 2: Browse term results Browse Term Results 6675 terms match your query ability political ability ABM abuses accommodation political accommodation accomplishment significant accomplishment human rights abuses Accord Trilateral Accord Accords Background De-Nuclearization Accords acceptance widespread acceptance broad acceptance accord access full access U.S. access accession quick accession accessions earlier accessions 12/19/2000 subsequent post-Soviet accord bilateral accord bilateral nuclear cooperation accord accords nuclear-weapon-free-zone accords accounting ... Wacholder 34

Avg. Doc Size - Columbia University

Related documents

Products

Support

Avg. Doc Size - Columbia University

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib