Wacholder_LangandInfo.ppt

advertisement
Language and Information
LIS 610
November 6, 2002
Nina Wacholder
nina@scils.rutgers.edu
Agenda
 Role of language in information science
 Current research: Human Computer Interaction with Electronic
Indexes and Index Terms
Language and Information 11/06/02 Nina Wacholder
2
Textual information
 Information conveyed by alphabets, digits and punctuation
 Organized into meaningful units recognized by some group of
people
Language and Information 11/06/02 Nina Wacholder
3
Other techniques for conveying information





Spoken language
Gesture
Facial expression
Sound
Images (drawings, photographs …)
Language and Information 11/06/02 Nina Wacholder
4
Language
 Uniquely human
 Learned
 Conventional
Language and Information 11/06/02 Nina Wacholder
5
Understanding language is hard
 Expresses complex concepts
 Ambiguity – words, phrases and sentences have more than one
meaning
 Synonymy – words, phrases and sentences have more than one
meaning
Language and Information 11/06/02 Nina Wacholder
6
Complex concepts





Pencil
Face
Directions to Alexander Library
Theory of relativity
U.S. election law
Language and Information 11/06/02 Nina Wacholder
7
Synonymy
 child, kid, adolescent, baby
 flammable, inflammable
 I was walking up the street that day.
 I was walking down the street that day.
 Moxie wrote that report.
That report was written by Moxie.
Language and Information 11/06/02 Nina Wacholder
8
Ambiguity-- semantic
 Bat
 Make a bed
 Moxie ate potatoes with a fork.
 Moxie ate potatoes with fish.
Language and Information 11/06/02 Nina Wacholder
9
Ambiguity– structural (syntactic)
 Red airplane terminal
• [[red airplane] terminal]
• [red [airplane terminal]]
 Moxie saw Toxie in the park with a telescope
• Moxie saw [Toxie in the park with a telescope]
• Moxie [saw] Toxie in the park [with a telescope]
Language and Information 11/06/02 Nina Wacholder
10
Natural language processing (NLP)
 Natural language
 Computer language
Language and Information 11/06/02 Nina Wacholder
11
The NLP controversy: rules vs. statistics
Language and Information 11/06/02 Nina Wacholder
12
NLP by rule
 Lexicon (vocabulary)
 Det: a
 ProperName: Moxie
 Noun: report
 Verb: wrote
 Syntactic rules
 NounPhrase[a report]  Det[a] Noun[report]
 NounPhrase[Moxie]  ProperName[Moxie]
 VerbPhrase[wrote a report]  Verb[wrote] NounPhrase[a
report]
 Sentence[Moxie wrote a report]  NounPhrase[Moxie]
VerbPhrase[wrote a report]
Language and Information 11/06/02 Nina Wacholder
13
NLP by statistics
 Luhn (1958)
 tf*idf (Salton and Buckley 1988)
 Maximum entropy (Berger, Della Pietra and Della Pietra 1996)
Language and Information 11/06/02 Nina Wacholder
14
Information-access tasks with significant natural language
component




Information retrieval
Information extraction
Automatic summarization
Question answering
Language and Information 11/06/02 Nina Wacholder
15
Sparck Jones (2001)
 Task core vs. task context
 Information retrieval: 30-40% accuracy for systems in natural
environment
 Information extraction: 50% for core systems
 Automatic summarization: no sound basis for core evaluation
Language and Information 11/06/02 Nina Wacholder
16
Evaluation of Head Sorting Mechanism
Wacholder, Klavans and Evans (2000)
 Task
 compare domain-independent, corpus-independent
methods for automatic identification of terms to represent a
document or collection of documents
 Methods for term identification

Head-sorted NPs (HS) (Wacholder 1998)

Keywords (KW)

Technical Terms (TT) (Justeson and Katz 1995)
Language and Information 11/06/02 Nina Wacholder
17
Examples of terms identified by indexing
method
Keywords
Head-sorted NPs
Technical terms
asbestos/asbestosis
workers
cancer deaths
worker/workers/worked
asbestos workers
lung cancer
cancer
160 workers
kent cigarette
death
cancer
dr. talcott
make
lung cancer
cigarette filter
lorillard
asbestos
u.s.
fiber
cancer causing asbestos
dr.
lung cancer deaths
…
...
Language and Information 11/06/02 Nina Wacholder
18
Cumulative Percentage
Ranking of terms by cumulative percentage
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
KWD
TT
SNP
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Rating
Language and Information 11/06/02 Nina Wacholder
19
Ranking by cumulative number of
terms
1 = best; 5 = worst
Number of terms
ranked at or better than
Method
KW
HS
TT
2
27
41
15
3
75
96
21
Language and Information 11/06/02 Nina Wacholder
4
124
132
21
5
166
160
21
20
Summary of results
Head-sorted terms
 mixed quality terms
 good document coverage
Technical terms
 high quality terms
 poor document coverage
Keywords
 low quality terms
 good document coverage
Language and Information 11/06/02 Nina Wacholder
21
ISATC Pilot Project
 Nina Wacholder, PI
 PhD Students: Lu Liu, Mark Sharp, Peng Song,
Xiaojun Yuan
Language and Information 11/06/02 Nina Wacholder
22
Research question
 Null hypothesis: Properties of index terms do not
affect information seeker’s selection of terms
 What properties of index terms affect the selection of
terms?
 What effects do these properties have?
Language and Information 11/06/02 Nina Wacholder
23
Material
 Text
 Rice, McCreadie and Chang (2001)
 Index terms
 Head sorted terms (Wacholder 1998)
 Technical terms (Justeson and Katz)
 Human index terms
Language and Information 11/06/02 Nina Wacholder
24
Experimental Searching and Browsing Interface
(ESBI)
http://www.scils.rutgers.edu/cgi-bin/indexer.cg
Language and Information 11/06/02 Nina Wacholder
25
Initial results
Language and Information 11/06/02 Nina Wacholder
26
Future work
 Further analysis of experimental data
 Compare subjects by type (e.g., undergraduate, MLIS)
 Effectiveness of searches (ie did they get the right answer)
 Overlap of words in index terms with words in question
…
 Evaluation of ESBI interface
 Comparison of additional techniques for identifying terms
 Use of different texts
Language and Information 11/06/02 Nina Wacholder
27
Download