Language and Information LIS 610 November 6, 2002 Nina Wacholder nina@scils.rutgers.edu Agenda Role of language in information science Current research: Human Computer Interaction with Electronic Indexes and Index Terms Language and Information 11/06/02 Nina Wacholder 2 Textual information Information conveyed by alphabets, digits and punctuation Organized into meaningful units recognized by some group of people Language and Information 11/06/02 Nina Wacholder 3 Other techniques for conveying information Spoken language Gesture Facial expression Sound Images (drawings, photographs …) Language and Information 11/06/02 Nina Wacholder 4 Language Uniquely human Learned Conventional Language and Information 11/06/02 Nina Wacholder 5 Understanding language is hard Expresses complex concepts Ambiguity – words, phrases and sentences have more than one meaning Synonymy – words, phrases and sentences have more than one meaning Language and Information 11/06/02 Nina Wacholder 6 Complex concepts Pencil Face Directions to Alexander Library Theory of relativity U.S. election law Language and Information 11/06/02 Nina Wacholder 7 Synonymy child, kid, adolescent, baby flammable, inflammable I was walking up the street that day. I was walking down the street that day. Moxie wrote that report. That report was written by Moxie. Language and Information 11/06/02 Nina Wacholder 8 Ambiguity-- semantic Bat Make a bed Moxie ate potatoes with a fork. Moxie ate potatoes with fish. Language and Information 11/06/02 Nina Wacholder 9 Ambiguity– structural (syntactic) Red airplane terminal • [[red airplane] terminal] • [red [airplane terminal]] Moxie saw Toxie in the park with a telescope • Moxie saw [Toxie in the park with a telescope] • Moxie [saw] Toxie in the park [with a telescope] Language and Information 11/06/02 Nina Wacholder 10 Natural language processing (NLP) Natural language Computer language Language and Information 11/06/02 Nina Wacholder 11 The NLP controversy: rules vs. statistics Language and Information 11/06/02 Nina Wacholder 12 NLP by rule Lexicon (vocabulary) Det: a ProperName: Moxie Noun: report Verb: wrote Syntactic rules NounPhrase[a report] Det[a] Noun[report] NounPhrase[Moxie] ProperName[Moxie] VerbPhrase[wrote a report] Verb[wrote] NounPhrase[a report] Sentence[Moxie wrote a report] NounPhrase[Moxie] VerbPhrase[wrote a report] Language and Information 11/06/02 Nina Wacholder 13 NLP by statistics Luhn (1958) tf*idf (Salton and Buckley 1988) Maximum entropy (Berger, Della Pietra and Della Pietra 1996) Language and Information 11/06/02 Nina Wacholder 14 Information-access tasks with significant natural language component Information retrieval Information extraction Automatic summarization Question answering Language and Information 11/06/02 Nina Wacholder 15 Sparck Jones (2001) Task core vs. task context Information retrieval: 30-40% accuracy for systems in natural environment Information extraction: 50% for core systems Automatic summarization: no sound basis for core evaluation Language and Information 11/06/02 Nina Wacholder 16 Evaluation of Head Sorting Mechanism Wacholder, Klavans and Evans (2000) Task compare domain-independent, corpus-independent methods for automatic identification of terms to represent a document or collection of documents Methods for term identification Head-sorted NPs (HS) (Wacholder 1998) Keywords (KW) Technical Terms (TT) (Justeson and Katz 1995) Language and Information 11/06/02 Nina Wacholder 17 Examples of terms identified by indexing method Keywords Head-sorted NPs Technical terms asbestos/asbestosis workers cancer deaths worker/workers/worked asbestos workers lung cancer cancer 160 workers kent cigarette death cancer dr. talcott make lung cancer cigarette filter lorillard asbestos u.s. fiber cancer causing asbestos dr. lung cancer deaths … ... Language and Information 11/06/02 Nina Wacholder 18 Cumulative Percentage Ranking of terms by cumulative percentage 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 KWD TT SNP 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Rating Language and Information 11/06/02 Nina Wacholder 19 Ranking by cumulative number of terms 1 = best; 5 = worst Number of terms ranked at or better than Method KW HS TT 2 27 41 15 3 75 96 21 Language and Information 11/06/02 Nina Wacholder 4 124 132 21 5 166 160 21 20 Summary of results Head-sorted terms mixed quality terms good document coverage Technical terms high quality terms poor document coverage Keywords low quality terms good document coverage Language and Information 11/06/02 Nina Wacholder 21 ISATC Pilot Project Nina Wacholder, PI PhD Students: Lu Liu, Mark Sharp, Peng Song, Xiaojun Yuan Language and Information 11/06/02 Nina Wacholder 22 Research question Null hypothesis: Properties of index terms do not affect information seeker’s selection of terms What properties of index terms affect the selection of terms? What effects do these properties have? Language and Information 11/06/02 Nina Wacholder 23 Material Text Rice, McCreadie and Chang (2001) Index terms Head sorted terms (Wacholder 1998) Technical terms (Justeson and Katz) Human index terms Language and Information 11/06/02 Nina Wacholder 24 Experimental Searching and Browsing Interface (ESBI) http://www.scils.rutgers.edu/cgi-bin/indexer.cg Language and Information 11/06/02 Nina Wacholder 25 Initial results Language and Information 11/06/02 Nina Wacholder 26 Future work Further analysis of experimental data Compare subjects by type (e.g., undergraduate, MLIS) Effectiveness of searches (ie did they get the right answer) Overlap of words in index terms with words in question … Evaluation of ESBI interface Comparison of additional techniques for identifying terms Use of different texts Language and Information 11/06/02 Nina Wacholder 27