Lecture03 Vocabularies.ppt

advertisement
Vocabulary & languages in
indexing & searching
Connection:
indexing
searching
tefkos@rutgers.edu; http://comminfo.rutgers.edu/~tefko/
© Tefko Saracevic
1
Central idea
Indexing and searching: inexorably connected
– you cannot search that that was not first indexed
in some manner or other
• to be searched everything is and must be indexed
somehow even if it is not called “indexed”
– indexing of documents or objects is done in order
to be searchable
• there are great many ways to do indexing
– to index one needs an indexing language
• there are great many indexing languages
– even taking every word in a document is an indexing language
Knowing searching is knowing indexing
Tefko Saracevic
2
ToC
1.
2.
3.
4.
Definitions
Controlled & uncontrolled vocabularies
Inverted indexes
Thesaurus
© Tefko Saracevic
3
1. Definitions
A few concepts from general to specific
© Tefko Saracevic
4
Defined concepts
valid for application in indexing & searching
General
– language
– vocabulary
Tefko Saracevic
Specific
–
–
–
–
–
–
–
–
index terms
indexing vocabulary
indexing language
descriptors
keywords
search terms
search vocabulary
query language
5
General definitions
[Encarta Dictionary]
Language
1. communication with
words: the human use of
spoken or written words as
a communication system
2. system of
communication: a system
of communication with its
own set of conventions or
special words
Tefko Saracevic
Vocabulary
1. words of language: all
the words used in a
language as a whole
2. words of subject
area: the set of words
associated with a subject or
area of activity, or used by
an individual person
6
Specific definitions
Starting from the most basic concept:
Index term:
A word or phrase that denotes (describes) a concept
& connotes (implies) a class
index term “table” describes a
and implies many kinds of tables:
for which, if desired, we may have more specific index terms
Tefko Saracevic
7
More definitions ...
Indexing vocabulary
a set of index terms used in a domain or for a set of
documents or objects
• it could be even a single document or object e.g. a book
Indexing language
an indexing vocabulary together with rules – syntax,
grammar – for their application and use
Tefko Saracevic
8
Variation on Index term
Descriptor
Word or phrase used to identify a topic or idea. Part
of a controlled vocabulary, normally listed in a
thesaurus (defined later) . May be used as a search term.
Keyword
A significant word from a text of a record which can
be used as a search term in a free-text search to
retrieve all the records containing it
– Could be assigned manually, but now done mostly
automatically – key entry in automatic indexing
Tefko Saracevic
9
Searching definitions
Question
request by a user related to user’s information need,
task, problem at hand
Question analysis
breakdown & elaboration of concepts in a question
to be translated into search terms
Query
question or part thereof as stated for searching
according to rules of a given system
© Tefko Saracevic
10
more ...
Search term
a counterpart to index term, also denoting a
concept and connoting a class for a search
Search vocabulary
a set of search terms in a domain or available in a
systems
Query language
a search vocabulary together with rules for their use
in searching
Tefko Saracevic
11
elaboration …
• Question is what user
asks and what you may
then have elaborated
• Query is what is asked
of computer to match –
what is put in for
searching
• Question is transformed
into query
Tefko Saracevic
• Example: Question:
– What are some major
historical developments
in the area of
information retrieval?
• Transformed into query
– history information
retrieval (in Google)
– history AND
information(w)retrieval
(in Dialog) (plus you have
to select which file(s) to
search
12
more …
“An index language is the language used to
describe documents and requests.
The elements of the index language are index
terms, which may be derived from the text of
the document to be described, or may be
arrived at independently.
The vocabulary of an index language may be
controlled or uncontrolled.”
(van Rijsbergen, 1979)
Tefko Saracevic
13
2. Controlled & uncontrolled
vocabularies
Approaches, tensions
Tefko Saracevic
14
Controlled vocabulary
• Predetermined – indicating what terms to be
used in indexing
– may show definition of and relations between
terms
• examples: thesaurus, subject heading list, classification
• Also indicates terms that may be selected for
searching
• An indexing AND a searching tool
• Human constructed
– and costly to construct and use
Tefko Saracevic
15
Example of controlled vocabularies
Medical Subject Headings
(MeSH) of the National
Library of Medicine
• One of the largest &
most comprehensive
– used in indexing &
searching
• More than 22,000
descriptors, with more
than 106,000 crossreferences
© Tefko Saracevic
• More than 139,000
Supplementary Concept
Records
• Approximately 50
publication types (Journal
Article, News, Editorial, Review,
Randomized Controlled Trial, etc)
• Done by indexers
• But also experimenting
with semi-automatic
indexing
16
Uncontrolled vocabulary
• Derived from texts – natural language - in
documents
– nowadays automatically
• using various ways or algorithms
– constantly tested: which algorithm is better?
• Used to construct inverted indexes
• In turn, inverted indexes are used for free text
searching
Tefko Saracevic
17
Comparison of vocabularies
Controlled
• The idea of a controlled
vocabulary is to reduce the
variability of expressions
used to characterize
documents being indexed &
searched for
• Manual, costly, time
consuming, also semiautomatic in some systems
• Dynamic – needs constant
changing, updating
© Tefko Saracevic
Uncontrolled or free
• The idea is to follow natural
language expressions as
they occur in documents
• Could be automatic
– great advantage
– algorithms constantly
changing & improving
• e.g. parsing phrases,
connections
• Prevailing in many
applications
18
Controlled vs. free text searching
• Endless source of debate &
controversy
• But, each has its place for given
circumstance & retrieval goal
• Each has strengths & weaknesses
• can you list or find a list comparing them? – this
is a good search assignment
• Users mostly use free text searching
• Professional searchers use both as
warranted – have to know when
• Professional credo:
KNOW THY CONTROLLED
VOCABULARY so you can apply it in
searching as/or when needed
Tefko Saracevic
19
3. Inverted indexes
Use in searching
Tefko Saracevic
20
Inverted indexes & searching
Useful to know how they function to understand
search & retrieval. Steps:
1. Each document is indexed
– every word in a document is taken as index term
with exception of stop words, if any
– position in text is noted, even for stop words
2. Indexes for all documents are merged
• index terms are arranged alphabetically in the bowel
of the system, so they can be searched
• under each index term are document numbers in which it
appears & position in text for that document
Tefko Saracevic
21
So, when you search
for digital AND libraries:
1. computer takes all documents under digital
2. and all documents under libraries
3. compares to “see” which documents have both terms and
then
4. provides you the list of those documents that have in the
document both terms, no matter where
• This is also called “coordinate indexing”
–
Tefko Saracevic
coordination is done at time of searching
22
Variation: when you search
for digital (WITH) libraries or
“digital libraries” i.e as a phrase
1. computer goes through the same steps as before but then
also
2. “looks” for documents where digital is positioned right
before libraries
• remember: computer “knows” position of each
term in each document, each sentence
•
So searching for a phrase is a form of searching of terms
connected with AND but in a given sequence
Tefko Saracevic
23
Example of searches in inverted file
Doc #
Text
1
Slow brown truck arrived
2
Shipment of brownies damaged in a fire
3
Delivery of brownies arrived in a slow truck
4
Shipment of brownies arrived in a truck
Search for slow AND truck gets as
results documents 1 and 3 since
both contain slow and truck
Search for slow (w) truck retrieves only
document 3 in which slow is 7th and
truck is 8th, they are right next to
each other. Doc 1 has both words, but
not next to each other thus not
retrieved
Tefko Saracevic
For simplicity documents
have one sentence.
Stop words: “a” “of” “in” –
but their position counted
Inverted index
Term
Position in doc number
arrived
(1:4), (3:4), (4:4)
brown
(1:2)
brownies
(2:3), (3:3), (4:3)
damaged
(2:4)
delivery
(3:1)
fire
(2:7)
shipment
(2:1), (4:1)
slow
(1:1), (3:7)
truck
(1:3), (3:8), (4:7)
24
Everything is inverted
- consequences for searching
• All words in all fields are
inverted, no matter if
– in title, full text, descriptor,
author …
• Thus all are searchable
• In some systems (but not
all) phrases are parsed &
thus searchable
– but in most phrases are
searched as AwB, or “AB”
Tefko Saracevic
• But beware:
– search for libraries as
descriptor
• e.g. libraries/DE in Dialog
– will retrieve ALL other
descriptors where libraries
appear in addition to
descriptor libraries itself
• e.g. academic libraries, public
libraries, special libraries,
research libraries …
– but there are search tricks to
avoid that
25
4. Thesaurus
A major tool for controlled vocabularies
in information retrieval (IR)
Tefko Saracevic
26
What is a thesaurus?
“For writers, it is a tool like Roget’s one with words
grouped and classified to help select the best word
to convey a specific nuance of meaning.
For indexers and searchers, it is an information storage
and retrieval tool: a listing of words and phrases
authorized for use in an indexing system, together
with relationships, variants and synonyms, and aids
to navigation through the thesaurus.”
(Milstead, 2000)
Tefko Saracevic
27
more…
“A thesaurus to an information scientist is a
controlled set of the terms used to index
information in a database, and therefore also
to search for information in that database so
the same concepts are represented by the
same term.”
(Batty, 1998)
Tefko Saracevic
28
Thesaurus
• Good old Peter Mark Roget had a most useful
idea in 1890s & did a great job
• Following this idea thesaurus became THE major
tool for controlled vocabulary in IR
– starting in 1950’s & to this day great many IR thesauri
have been developed for all kinds of subjects
• including, for instance, in information science
– all have a similar structure & function
– but they are difficult & costly to construct & maintain
Tefko Saracevic
29
Standards, software
• Subject to international standards:
– “Guidelines for the Construction, Format, and
Management of Monolingual Controlled Vocabularies”
ANSI/NISO Standard Z39.19
– followed by “Construction of Controlled Vocabularies. A
Primer”
• A number of software products are available for
thesaurus construction and maintenance
– e.g. as listed by American Society for Indexing
© Tefko Saracevic
30
Examples of thesauri
• Thesauri have been constructed for great many
domains, from A to Z
– here are some lists
• international & multilingual thesauri
• online thesauri
• among them ERIC Thesaurus (we use it for example)
– BUT: different thesauri may and do treat the same
descriptor (index term) differently
• having different, more or fewer narrower, broader, related
terms
• thus it is dangerous to use them interchangeably
Tefko Saracevic
31
Basic thesaurus components
• For each entry thesaurus has a classification grid:
– Descriptor (DE) – an index term that has
•
•
•
•
•
Scope note (SN) – context in which used
Broader terms (BT) – higher in a hierarchy
Narrower terms (NT) – lower in a hierarchy
Related terms (RT) – other connected descriptors
Used for (UF) – synonyms that are not descriptors
– Note: not all of these may be present for every descriptor
• A searcher or indexer can use these as a guide for
selection/rejection & for browsing to get ideas
Tefko Saracevic
32
Standard structure
With variations on the theme, thesauri have similar
conceptual structure to guide searcher or indexer:
Broader terms - BT
Related terms - RT
Descriptor - DE
Used for - UF
Synonyms
Scope note - SN
Narrower terms - NT
Note: Every descriptor doesn't have to have all of these
Tefko Saracevic
33
Same thesaurus but …
•
Examples of ERIC (Educational Resources Information
Center) thesaurus as used differently in different systems:
1.
2.
3.
•
•
ERIC own system
ERIC file on Dialog (begin 1)
ERIC file on OVID (accessible through RUL)
Notice how each uses the same ERIC thesaurus displays &
search in its own way, but principles still the same
Oh well…
Tefko Saracevic
34
ERIC online thesaurus on ERIC
• Allows for
– searching for words that are included in
descriptors by category or all categories
– browsing alphabetically
– browsing in one of about 40 categories
• Search for libraries in all categories found 50
descriptors that have “library” included
• Out of these selected libraries
Tefko Saracevic
35
ERIC online thesaurus on ERIC
descriptor libraries
Descriptor
Other descriptors – one could browse
© Tefko Saracevic
36
ERIC thesaurus on Dialog
• In a convoluted way ERIC thesaurus (and other ones)
can be displayed on Dialog (and other vendors, such
as OVID)
• How?
– begin in file 1 – ERIC
– then expand a desired term – here we used term library
– you will see under R that certain terms have related terms
– meaning that these are thesaurus entries
– then expand on one of those to see related terms
– then you can browse & choose which ones to use in search
• And here are printed screens of the process
Tefko Saracevic
37
Note on command expand (E) in Dialog
• Dialog (and some other
systems) has a neat way
to display all entries in
any inverted index
alphabetically
– command is Expand or e
– it could be done in any of
the indexes – basic and
additional
© Tefko Saracevic
For instance:
e library will provide alpha
list of term library in basic
index & then after
expanding again you can
see related terms (see next)
e Au=Saracevic will provide
alpha list of all entries in
the author additional index
around that name
38
going
Expand
library
Tefko Saracevic
39
going …
RT indicates
related terms
46865 items
have
library
This one has
14 related
terms
Tefko Saracevic
40
going …
We now chose descriptor LIBRARY ADMINISTRATION and expand on that
one
Neat trick:
You can expand on expand
& get related terms out of Eric thesaurus
Tefko Saracevic
41
These are now
R terms of
various type
going …
14 related
terms for this
one are listed
Can expand on
this one to see
other RT
You can also
select any of
these to search
Tefko Saracevic
42
going …
We have now selected r15 – library services to
search for documents
Tefko Saracevic
43
going …
And this is
the no. of
items we got
Now we can view some items in a chosen format
or we can further modify this search - add refine, …
Tefko Saracevic
44
gone
This is one
of the items
we got
Descriptors
used for
this item
Tefko Saracevic
Additional
index terms
45
Start ERIC search on OVID (accessed through RUL)
© Tefko Saracevic
46
Automatically gets you to thesaurus
This one of
selected to
enlarge
Tefko Saracevic
47
Allows you to select thesaurus (or not)
This one of
selected to
enlarge
Tefko Saracevic
48
Then go to ERIC thesaurus on OVID (accessed through RUL)
© Tefko Saracevic
49
gone
•
•
•
•
•
•
Next go and select additional terms
Or search for libraries only
See no. of results
Select fields and formats by making a check
and happy going …
suggestion: repeat this exercise
Point being that the same thesaurus is
handled differently by different databases
Tefko Saracevic
50
Relevance feedback an important search tactic
• Method for using information in items judged
relevant to further refine or change the search
– first you find a relevant document (or documents)
– in relevant document(s) you browse titles,
descriptors, identifiers, abstracts … to get leads
(e.g. keywords) for further search terms & tactics
– then you search for those
• in some advanced systems this may be done
automatically
Tefko Saracevic
51
Query expansion –
another important search tactic
• Method for adding, modifying, changing search
terms in a query
– to broaden, narrow, focus, change … terms
• Many sources can be used
– relevance feedback, thesauri, dictionaries,
textbooks, documents, catalogs, & people: users,
colleagues, your own mind & experience
• Some systems suggest terms for query
expansion
Tefko Saracevic
52
Query expansion tactics
• You can use the same structure for expanding
query terms as in a thesaurus
– think of what may be broader, narrower, related terms or
synonyms to use as search terms
Broader terms - BT
Related terms - RT
Query term
Synonyms
Narrower terms - NT
Tefko Saracevic
53
Conclusion
• At the base of all searching are
–
–
–
–
terms
vocabularies
languages
but a variety exists
• In reality in searching there is no
completely controlled or
uncontrolled vocabulary
– matter of degree
– & most importantly, matter of
mastery
Tefko Saracevic
54
symbolically;
controlled & free vocabulary
Tefko Saracevic
55
thank you!
Tefko Saracevic
56
Download