Vocabulary.ppt

advertisement
Vocabulary & languages
in searching
Connection:
indexing
searching
© Tefko Saracevic
1
Basic assertion
Indexing and searching:
inexorably connected
– you cannot search that that was not
first indexed in some manner or other
– indexing of documents or objects is
done in order to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
– even taking every word in a document is an
indexing language
Knowing searching is knowing indexing
© Tefko Saracevic
2
General definitions
Vocabulary [Encarta Dictionary]
“1. words known
LANGUAGE - all the words used by or known to a
particular person or group, or contained in a
language as a whole”
Language
“1. speech of group
the speech of a country, region, or group of
people, including its diction, syntax, and
grammar
2. system of communication
a system of communication with its own set of
conventions or special words”
© Tefko Saracevic
3
From general to specific
• These general definitions are valid
for application in indexing &
searching to define
–
–
–
–
–
–
index terms
indexing vocabulary
indexing language
search terms
search vocabulary
query (request, search) language
© Tefko Saracevic
4
Specific
Index term
a word or phrase that denotes (describes)
a concept & connotes (implies) a class
index term “table” describes a
and implies many kinds of tables:
for which, if desired, we may have more specific index terms
© Tefko Saracevic
5
Specific ...
Indexing vocabulary
a set of index terms used in a domain or
for a set of documents or objects
• it could be even a single document or object
e.g. a book
Indexing language
an indexing vocabulary together with
rules – syntax, grammar – for their
application and use
© Tefko Saracevic
6
Specific ...
Search terms
a counterpart to index terms, also
denoting a concept and connoting a
class for a search
Search vocabulary
a set of search terms in a domain or
available in a systems
Query language
a search vocabulary together with rules
for their use in searching
© Tefko Saracevic
7
More
“An index language is the language
used to describe documents and
requests.
The elements of the index language
are index terms, which may be
derived from the text of the
document to be described, or may be
arrived at independently.
The vocabulary of an index language
may be controlled or uncontrolled.”
(van Rijsbergen, 1979)
© Tefko Saracevic
8
Controlled vocabulary
• Predetermined – indicating what
terms to be used in indexing
– may show definition of and relations
between terms
• examples: thesaurus, subject heading list,
classification
• Also indicates terms that may be
selected for searching
• An indexing AND a searching tool
• Human constructed
– and costly to construct and use
© Tefko Saracevic
9
Uncontrolled vocabulary
• Derived from documents
– nowadays automatically
• using various ways or algorithms
– constant issue: which way is “better”
• Used to construct inverted indexes
• a concordance, such as of the Bible,
indicating place and position of each word
mentioned in the text is an inverted index
• monks used to do it in 12th century,
computers do it today
• Inverted indexes are used for free
text searching
© Tefko Saracevic
10
Controlled vs. free text searching
• Endless source of debate &
controversy
• But, each has its place for
given circumstance &
retrieval goal
• Each has strengths &
weaknesses
• can you list or find a list
comparing them?
• Users mostly use free text
searching
• Professional searchers use
both as warranted
• As option:
KNOW THY CONTROLLED
VOCABULARY
© Tefko Saracevic
11
Inverted indexes
Useful to know how they function to
understand search & retrieval. Steps:
1. Each document is indexed
– every word in a document is taken as
index term with exception of stop words
– position in text is noted
2. Indexes for all documents are merged
• index terms are arranged alphabetically
in the bowel of the system
• under each index term are document numbers in
which it appears & position in text for that
document
© Tefko Saracevic
12
So, when you search
for digital AND libraries:
1. computer takes all documents under
digital
2. and all documents under libraries
3. compares to “see” which documents have
both terms and then
4. provides you the list of those documents
in a default format or you may choose a
format
• This is also called “coordinate indexing”
–
coordination is done at time of searching
© Tefko Saracevic
13
Variation: when you search
for digital (WITH) libraries or
“digital libraries” i.e as a phrase
1. computer goes through the same steps as
before but then also
2. “looks” for documents where digital is
positioned right before libraries
• remember: computer “knows” position of
each term in each document, each
sentence
• So searching for a phrase is a form of
searching of terms connected with AND
but in a given sequence
© Tefko Saracevic
14
Example of inverted file
Doc #
Text
1
Slow brown truck arrived
2
Shipment of brownies damaged in a fire
3
Delivery of brownies arrived in a slow truck
4
Shipment of brownies arrived in a truck
Search for slow AND truck gets as
results documents 1 and 3 since
both contain slow and truck
Search for slow (w) truck retrieves only
document 3 in which slow is 7th and
truck is 8th, they are right next to
each other. Doc 1 has both words, but
not next to each other thus not
retrieved
© Tefko Saracevic
For simplicity documents
have one sentence.
Stop words: “a,” “of,” “in.”
Inverted index
Term
Position in doc number
arrived
(1:4), (3:4), (4:4)
brown
(1:2)
brownies
(2:3), (3:3), (4:3)
damaged
(2:4)
delivery
(3:1)
fire
(2:7)
shipment
(2:1), (4:1)
slow
(1:1), (3:7)
truck
(1:3), (3:8), (4:7)
15
Thesaurus
• Good old Peter Mark Roget had a most
useful idea & did a great job
• Following this idea thesaurus became
THE major tool for controlled
vocabulary in information retrieval
(IR)
– starting in 1950’s & to this day many IR
thesauri have been developed
– all have a similar structure & function
– but they are difficult & costly to
construct
© Tefko Saracevic
16
What is a thesaurus?
“For writers, it is a tool like Roget’s
one with words grouped and classified to
help select the best word to convey a
specific nuance of meaning.
For indexers and searchers, it is an
information storage and retrieval tool: a
listing of words and phrases authorized
for use in an indexing system, together
with relationships, variants and
synonyms, and aids to navigation through
the thesaurus.”
(Milstead, 2000)
© Tefko Saracevic
17
more…
“A thesaurus to an information
scientist is a controlled set of
the terms used to index information
in a database, and therefore also
to search for information in that
database so the same concepts are
represented by the same term.”
(Batty, 1998)
© Tefko Saracevic
18
Basic thesaurus components
• For each entry thesaurus has a
classification grid:
– Descriptor (DE) – an index term that has
•
•
•
•
•
Scope note (SN) – context in which used
Broader terms (BT) – higher in a hierarchy
Narrower terms (NT) – lower in a hierarchy
Related terms (RT) – other connected descriptors
Used for (UF) – synonyms that are not descriptors
– Note: not all of these may be present for every
descriptor
• A searcher or indexer can use these as a
guide for selection/rejection & for browsing
to get ideas
© Tefko Saracevic
19
Examples of thesauri
• Thesauri have been constructed for
great many domains, from A to Z
– here are some lists
• international & multilingual thesauri
• online thesauri
• among them ERIC Thesaurus (we use it for example)
– BUT: different thesauri may and do treat
the same descriptor (index term)
differently
• having different, more or fewer narrower,
broader, related terms
• thus it is dangerous to use them interchangeably
© Tefko Saracevic
20
Standard structure
With variations on the theme, thesauri have similar
conceptual structure to guide searcher or indexer:
Broader terms - BT
Related terms - RT
Descriptor - DE
Used for - UF
Synonyms
Scope note - SN
Narrower terms - NT
Note: Every descriptor doesn't have to have all of these
© Tefko Saracevic
21
Same thesaurus but …
•
Examples of ERIC (Educational Resources
Information Center) thesaurus as used
differently in different systems:
1. ERIC own system
2. ERIC file on DIALOG (begin 1)
3. ERIC file on OVID (accessible through RUL)
•
•
Notice how each uses thesaurus displays
& search in its own way, but principles
still the same
Oh well…
© Tefko Saracevic
22
ERIC online thesaurus on ERIC
• Allows for
– searching for words that are included
in descriptors by category or all
categories
– browsing alphabetically
– browsing in one of about 40 categories
• Search for library in all
categories found 76 descriptors
that have “library” included
• Out of these selected library
education
© Tefko Saracevic
23
ERIC online thesaurus on ERIC
descriptor library education
© Tefko Saracevic
24
ERIC thesaurus on DIALOG
• In a convoluted way ERIC thesaurus (and
other ones) can be displayed on DIALOG
(and other vendors, such as OVID)
• How?
– begin in file 1 – ERIC
– then expand a desired term – here we used term
library
– you will see under R that certain terms have
related terms – meaning that these are
thesaurus entries
– then expand on one of those to see related
terms
– then you can browse & choose which ones to use
in search
• And here are Print Screens of the process
© Tefko Saracevic
25
going …
Expand
library
© Tefko Saracevic
26
RT indicates
related
terms
going …
45237
items
have
library
This one has
14 related
terms
© Tefko Saracevic
27
going …
We now chose descriptor LIBRARY ADMINISTRATION and expand on that
one
Neat trick:
You can expand on expand
& get related terms
© Tefko Saracevic
28
These are now
R terms of
various type
going …
14 related
terms for this
one are listed
Can expand on
this one to see
other RT
You can also
select any of
these to search
© Tefko Saracevic
29
going …
We have now selected r10 – library expenditures
© Tefko Saracevic
30
going …
And this is
what we got
Now we can view some items in a chosen format
or we can further modify this search - add refine, …
© Tefko Saracevic
31
gone
This is one
of the items
we got
Descriptors
used for
this item
© Tefko Saracevic
32
ERIC thesaurus on OVID
(accessed through RUL)
For library
ask to map
as
thesaurus
term
© Tefko Saracevic
33
going …
There are
more down
there but we
choose this
one to
expand
© Tefko Saracevic
34
going …
Entries for
descriptor Electronic
Libraries
Continue to
search for
AND
© Tefko Saracevic
35
going …
Retrieved &
ready to
display
© Tefko Saracevic
36
gone
© Tefko Saracevic
Choose format you
want for this item
37
Relevance feedback
• Method for using information in
items judged relevant to further
refine or change the search
– e.g. in relevant items we can browse
titles, descriptors, identifiers,
abstracts … to get leads for further
search terms & tactics
• in some advanced systems this may
be done automatically
© Tefko Saracevic
38
Query expansion
• Method for adding, modifying,
changing search terms in query
– to broaden, narrow, focus, change …
terms
• Many sources can be used
– relevance feedback, thesauri,
dictionaries, textbooks, documents,
catalogs, & people: users, colleagues,
your own mind & experience
• Some systems suggest terms for query
expansion
© Tefko Saracevic
39
Conclusion
• At the base of all
searching are
–
–
–
–
terms
vocabularies
languages
but a variety exists
• In reality in searching
there is no completely
controlled or
uncontrolled vocabulary
– matter of degree
– & most importantly,
matter of mastery
© Tefko Saracevic
40
symbolically;
controlled & free vocabulary
© Tefko Saracevic
41
thank you!
© Tefko Saracevic
42
Download