Indexes and Indexing - PLAI-Bicol Region Librarians Council

advertisement
INDEXES AND
INDEXING
Ma. Theresa B. Villanueva
Head, Microforms and Digital Resource Center
Rizal Library, Ateneo De Manila University
April 15-16, 2013
James O’Brien Library-Ateneo de Naga University
DEFINITION OF TERMS
Index

a tool, which indicates to a user the information
or a source of information that one needs
a systematic guide designed to indicate
subjects, topics, or features of documents in
order to facilitate their retrieval
2
Indexing

the process of identifying and assigning index
terms to a document, either to describe its
physical characteristics, give facts about its
creator or distribution, or describe its content
3
General Purposes of Indexes

To construct representations of documents in a
form that is suitable to the users to browse
through

To maximize the searching success of the users

To minimize the time and effort in finding
information
4
Uses of Indexes
•
•
•
facilitate reference to the specific material or to
locate wanted information
serve as filter to withhold irrelevant materials
make the information storage and retrieval
system useful to individual
•
disclose related information
•
tool for current awareness services
5
Alphabetical
Classified
Book
Audiovisual
Periodical/Newspaper
Concordance
Card index
Printed
Microform
Computerized
6
By Arrangement
a. Alphabetical Index - is based on the orderly principle
of letters of the alphabet; used for the arrangement of
subheadings, cross references as well as main
headings
b. Classified Index – contents are arranged systematically
by classes or subject headings
c. Concordance – is in alphabetical index of all principal
words appearing in a single text or in a multi-volume
of a single author w/ a precise pointer to the precise
point at which the word occurs.
7
By Physical Form
a) Card index – an index in which 3” x 5” cards are
used as the tools
b) Printed index – a tool for indexing or for
researching and retrieval of information that is in
printed form
c) Microform index – index to microforms such as
microfiche and microfilm
d) Computerized index – uses computers to construct
indexes
8
By Type of Materials Index
a. Audiovisual Material Index
- textual labeling (index terms or description) is
needed along with image matching
- search on words may retrieve a particular
image related to the search term which in turn
can be used as input to find other related entries
9
b. Book index
- a list of words or group of words arranged
alphabetically, at the back of the book
giving a page location of the subject
or name associated with each word.
10
Periodical Index/Newspaper Index
- open-ended projects usually performed
by group of people
- consistency is a challenging part since
each periodical issue may deal with
unrelated topics by several authors
- written in different styles and aimed at
different users.
11
Classified Index
Entry
points are arranged
in a hierarchy
Alphabetical
Subject
Indexof related topics,
starting with generic or broad topics and working down to
an alphabetical
subject index covers a number of different
Author
the specific
ones. Index
kinds of indexes. The arrangement is in alphabetical order
Examples:
Entry points are names of persons, organizations,
and
follows
a familiar
pattern.
- Index
Medicus
– classified
index
in the field of medicines and related
government
agencies, institutions, etc.
disciplines
Periodicals
Indexes
- Engineering Index – classified index in the field of engineering and
Examples:
related disciplines
Examples:
- Reader’s Guide to Periodical Literature (RGPL)
- Development
of the (IPP)
Philippines
- Index
to Philippine Bank
Periodicals
- Philippine Chamber of Commerce and Industry
- Romulo, Carlos P.
12
Specificity
Exhaustivity
- refers to the extent to
which a document is
analyzed to identify
its subject content
INDEXING
PRINCIPLES
– refers to the extent to
which a concept or topic
in a document is identified
by precise term in the
hierarchy of its genusspecies relations
Consistency
–refers to the extent to which
agreement exists on the terms
to be used to index contents
of documents
13
Principle of Exhaustivity
•
Exhaustive indexing
use of various index terms to fully cover the
major and minor themes of document
•
Selective indexing
use of a few terms to cover only the main or
major theme of a document
 Exhaustivity results to high recall but low
precision.
14
Principle of Specificity
Example:
Genus: Citrus Fruits
Species:
ORANGES
LEMONS
LIMES
GRAPEFRUITS

Specificity would result to high precision but low recall
15
Principle of Consistency
There are two types of consistency level:
Inter-indexer consistency
refers to the agreement between or among
indexers in assigning subject terms in a
particular article
Intra-indexer consistency
refers to the extent to which one indexer is
consistent to himself/herself on assigning
subject terms.
16
Indexing Methods
1. Derived or derivative indexing
– a method by which words and phrases
occurring in the title or text of documentary
unit are extracted by a human or computer
to serve as indexing terms.
- also called an extractive indexing.
17
2.
Assigned indexing
- a method by which terms, descriptors or
subject headings are selected by a
human or computer to represent the
topics or features of a documentary unit
- assigned terms are often times taken from a
source other than the document itself.
18
Indexing Language
An indexing language is a language that is
used by the indexer to represent the subject
content of a document.
19
Purposes and Uses of Indexing Language:
to represent the subject content of a document
either using the words of the author or assigning
appropriate descriptors from a controlled
vocabulary
to help users discriminate between terms and
reduce ambiguity in the language
20
Types of Indexing Language
1. Natural Language
- uses index terms/words occurring in the
printed text as index entries; it is sometimes
called derived-term system
21
Characteristics of using Natural
Language:
•
Improves recall because it provides more
access point but reduces precision
•
Redundancy is greater
•
Uses more current terms
•
Tends to be favored by end-users
22
2. Controlled vocabulary
- represent the general conceptual
structure of one or more subject areas
and presents a guide to the users of the
index
- categorized as assigned-term system
23
Controlled Vocabulary provides cross
references in the form of Use:
To show the three relationships of terms:
a) equivalence
b) hierarchical
c) associative
This is achieved by providing or showing under:
broader term (BT)
narrower term (NT)
related terms (RT)
use for (UF)
see also (SA)
24
Relationships of Terms:
a. Equivalence relationship - implies that
there will be more than one term denoting
the same concept
25
Equivalence relationship:
Example 1
Use for (UF) or Use reference (see reference)
Example: EMPLOYEES
UF: Personnel
Staff
Workers
- refers to a preferred descriptor from a non-usable term
26
Equivalence relationship
Example 2:
BIRTH CONTROL
UF : Family Planning
- reference deals primarily with synonymous or
variant forms of the preferred descriptor
- it is also used to lead the indexer to more
general terms
27
Examples that indicate Equivalence relationship:
 Synonyms
(e.g. Reason; Cause)
 Quasi-synonyms
(e.g. Law; Law Management)
 Preferred
spelling (e.g. Catalog; Catalogue)
 Acronyms
and abbreviations (e.g. ASEAN;
Association of Southeast Asian Nations)
 Current
and established terms (e.g. Cellular
Radio; Cellular Phone)
 Translation
(e.g. Coconut Coir; Bunot)
28
b. Hierarchical relationship
– refers to the general and specific or broad and narrow
type of relationship
29
Hierarchical relationship
Example 1 :
Broader term (BT)
Employees
BT : People
- shows hierarchical relationship upward in the classification
ranking
-
it differs from the use for reference in that both the basic
terms and its broader term are descriptor terms and both
can be used
30
Hierarchical relationship:
Example 2
Cats
BT: ANIMALS
"ANIMALS" is a broader term to
"CATS“ because all cats are
animals.
Reference: http://publish.uwo.ca/~craven/677/thesaur/main05.htm
Hierarchical relationship:
Example 3
Narrower term (NT)
Employees
NT : HOTEL EMPLOYEES
RAILROAD EMPLOYEES
- reference is similar to the broader term reference,
except it goes down in the classification ranking
32
Hierarchical relationship:
Example 4
Head
NT : NOSE
“NOSE” might be a narrower term to
“HEAD”, because noses are normally
parts of heads.
Reference:
http://publish.uwo.ca/~craven/677/thesaur/main05.htm
Genus – species relationship (represent class

inclusion)
Example: Animals
Domestic Animals
Cats
Whole-part relationship

Example: Hand
Fingers
Instance relationship

Example: Mountains
Mount Apo
34
c. Associative relationship
- refers to a non-hierarchical relationship
of terms
35
Associative relationship
Example 1 :
Related term (RT)
EMPLOYEE
RT :
EMPLOYMENT
- reference refers to a descriptor that can be
used in addition to the basic term but not
in a hierarchical relationship
36
Associative relationship
Other Examples :
 Teachers
– Student
 Tables – Chairs
 Education – Teaching
 Men – Women
37
Scope Note (SN) & Qualifier - used to give the users
about the descriptor’s usage restrictions or to clarify
ambiguity; a scope note may give additional instructions to
indexers
Scope Note:
Examples: INDEXING (SN)
Assigning of natural language terms
to documents
HOSPITALIZATION (SN) Assign also terms for the
conditions for which patients were hospitalized, if
applicable
Qualifier:
Example: Security (Law)
Security (Psychology)
Reference: http://publish.uwo.ca/~craven/677/thesaur/main08.htm
38
Functions of Controlled Vocabulary:
•
•
•
To control synonyms by choosing one form
as the standard term
To make distinction among homographs
To link or bring together those terms whose meaning are
closely related
Example: Cereals and Wheat
•
Controls variant spelling
39
A controlled vocabulary may take the form of verbal
expressions as illustrated by Subject Headings Lists
and Thesauri or coded/nonverbal expressions as
shown by Classification schemes.
Subject headings lists – are lists of terms representing
several subject fields; some focus on specific fields
Thesauri – are another authority devices that cover more
specific or narrower subject fields
Classification schemes – generally contain coded expression
or notations to the relevant topics in a particular class or
subclass
40
INDEXING
GUIDELINES & PROCEDURES
Part 2
41
INDEXING PROCESS:
1. Recording of bibliographic data
- recording of the important information or the
elements that identify a particular document
The International Organization for Standards
(ISO) set a Standards for bibliographic
references:
ISO 690 1975 (E)- “Bibliographic References
Essential and Supplementary Elements”
42
- When indexing contents of a collection of documents,
locators should give complete information about each
document.
- for periodical articles, each entry normally consists of
the following elements:
Essential elements for an article or contribution in a
periodical are:
Name(s) of Author(s) with forenames
Title of the article
Title of the periodical or Source
Volume Number
Issue Number
Date of the issue
Page number
43
Example:
Name(s) of Author(s): [Xian, Jie]
Title of the article : [Hybrid rice: a new hope towards a
bountiful Philippines]
Title of the periodical or Source : [Impact]
Volume Number : [46]
Issue Number : [9]
Date of the issue : [September 2007]
Page number : [4-8]
44
ISO FORMAT:
Sample entry:
________________
(subject/Topic)
Xian, Jie. Hybrid rice: a new
hope towards a bountiful
Philippines. Impact, Vol. 46,
no.9, S ‘12, p. 4-8.
Format comparison:
ISO FORMAT:
_______________
(subject/topic)
Xian, Jie. Hybrid rice: a new
hope towards a bountiful
Philippines. Impact, Vol. 46,
no.9, S ‘12, p. 4-8.
ATENEO FORMAT:
________________
(subject/Topic)
Hybrid rice: a new hope towards a
bountiful Philippines. Xian, Jie.
Impact 46 (9) : 4-8. S ‘12.
OTHER FORMAT:
_______________
(subject/topic)
Xian, Jie. Hybrid rice: a new hope
towards a bountiful Philippines.
Impact 46 (9) : 4-8. S ‘12
46
2. Subject determination
“aboutness of the material and the formulation of a
concept list
•
•
Choose the most appropriate concepts; consider the users &
the purpose of the index
No arbitrary limit should be set to the number of terms or
descriptors which can be assigned to a document.
- it should be determined fully by the amount of information
contained in the document
- it should be related to the expected needs of the users of
the index.
47
•
•
Modify the indexing guidelines and procedures if needed;
but modification should not compromise the structure or
logic of the indexing language.
Concepts should be as specific as possible. More general
concepts may be preferred in some circumstances,
depending upon the following factors:
–
–
over-specificity might adversely affect the performance
of the indexing system.
if an idea is not fully developed, or is referred to only
casually by the author, then it might be justified to
index at a more general level
48
3. Content/Conceptual analysis
– identifying the topics discussed in a
document and determining what aspects
of its users will be interested in
49
Content Analysis
- Decide which topics in the item are relevant to
the potential user of the document.
- Decide which topics truly capture the content
of the document.
- Determine terms that come as close as possible
to the terminology use in the document.
- Decide on index terms and the specificity
of those terms.
50
Parts of the document that have to be
analyzed


Title of the document/article
- it is considered as basic indexing unit
- it is the first stop in determining the subject
content
Abstract
- actual information-packed miniature of
documents;
- good abstract can be fundamental indicator of
subject content
51



Text itself
- includes introduction, summary, conclusion,
section heading, first & last sentences of the
paragraph
Illustrations, diagrams, tables and captions
References
- reference sources cited by the author may also
be considered as subject indicator
52
Factors that may affect content analysis:



if there is labor shortage or other critical time factor
the guidelines and policies imposed by institutions
that generally concerns with the selection of index
content
decisions of the indexer which aspects of the
subjects will be emphasized and which aspects
will be deemphasized
53
4. Translation
- involves the conversion of terms in the natural
language into standard terms drawn from a
controlled vocabulary such as thesaurus,
subject headings list, etc.
- match terms in the concept list against those
available in the controlled vocabulary
54
Practices to follow in the Translation process:
- Concepts which are already translated into indexing terms
should be translated into their preferred terms
- Terms which represent new concept should be checked
for accuracy and acceptability from the reference tools
such as:
◦
◦
◦
◦
Dictionaries and encyclopedias
Thesauri (UNBIS Thesaurus)
Classification schemes (Library of Congress)
Established indexes (Reader’s Guide to Periodical
Literature)
55
- Subject specialist, particularly those with some knowledge
of indexing or documentation, may also be consulted
- If the concepts are not found in existing thesaurus or
classification scheme, these may be:
• expressed by terms or descriptors which are admitted
into indexing language
• represented temporarily by more general terms; the
new concepts being proposed as candidates for later
addition
56
Translation
- Group references to information that is scattered
in the text of the document.
- Combine heading and subheadings into related
multilevel headings.
- Direct the user seeking information under terms not
used to those that are being used by means of see
references and to related terms with see also references.
- Arrange the index into a systematic presentation
57
Generating Index Entries
Index entries maybe generated manually or using
the computer.
Manual generation- involves generation of index entries
one by one using an ordinary or electric typewriter
Machine generation- involves the use of the computers in
generating index entries; various software packages are
available
58
Indexing Techniques for Periodicals
1. Topics that can be considered for indexing are the
following:
- persons
- sports events
- economic news
- special features
- social trends
- local politics
- entertainment
- editorials & columns
- first and last events
59
• All article that have permanent value should be indexed
under all topics and issues dealt with
•
•
Editorials should be indexed under their topics as any other
article but differentiated with others by adding (Ed.) or (E).
The titles of editorials may be indexed under a collective
heading “Editorials”.
Letters to the editor if considered indexable should be
indexed by topic, not under a caption that may have been
assigned by the editor. It is advisable to index at least the
name of the person who criticized an article as well as the
author’s response.
60
2. Preference and Forms of Headings based on the
International Organization for Standardization
(ISO 999)
Personal Names:
–
–
–
Provide as full a form as possible
Choose the most recent/most commonly used form of
personal name as the heading and add “see” crossreference from other forms
Personal names should be take the form used in the
document, but if the text is not consistent the indexer
should adopt one form.
61
–
Compound and multiple surnames, whether hyphenated or
not, should be indexed under the first part
e.g. Lee Chua, Queena, Loren ; Perez de Cueller, Javier
–
Persons normally identified by title of honor or nobility
should be indexed under the first name
e.g. Prince Charles see Charles, Prince of Wales
Queen Elizabeth I see Elizabeth I, Queen of England
62
Corporate Bodies
• Names of the corporate bodies should normally be indexed
without transportation and in as full a form as necessary. An
initial article is omitted , unless specifically required for
semantic or grammatical reasons
e.g. Lopez Museum
• Transposition maybe used if it is considered that this would
help the users of the index
e.g. Department of Energy
see Energy, Department of
• Choose the most recent, or the most commonly used, form
of corporate name as the main heading and add “see” cross
references from other forms
e.g. Philippine Normal College
see Philippine Normal University
63
Geographic Names
• Geographic names should be as full as is necessary for
clarity, with additions to avoid confusion with the otherwise
identical names
Example: J.P. Rizal (Quezon city)
J.P. Rizal (Marikina)
• An article or preposition should be retained in a geographic
name of which it forms an integral part
Example: Santolan, Pasig City
• Where the article or preposition does not form an integral
part of a name it should be omitted
Example: New Day rather than The New Day
64
INDEXING STANDARDS
Part 3
65
Standards serve as models and guidelines
for the analysis of documents, construction
and organization of indexes, indexing
terminology, construction and use of
thesauri, etc. they promote consistency
and uniformity.
66
A. International Organization for
Standardization
-is a network of the national standards institutes of 146
countries, on the basis of one member per country,
with a Central Secretariat in Geneva, Switzerland that
coordinates the system.
67
ISO 5963: 1985 Documentation
– Methods for examining documents, determining their
subjects, and selecting indexing terms
ISO 999: 1996 Information and documentation
– Guidelines for the content, organization and presentation
of indexes
ISO 4: 1997 Information and documentation
– Rules for the abbreviation of title words and titles of
publications. It publishes a List of Serial Title Word
Abbreviations which includes title word abbreviations in
over 50 languages.
68
B. National Information Standards Organization
(NISO)


A nonprofit association accredited by the American National
Standards Institute (ANSI) that
 identifies,
 develops,
 maintains and
 publishes technical standards to manage information in
our changing and ever-more digital environment.
NISO standards apply both traditional and new technologies
to the full range of information-related needs, including
retrieval, repurposing, storage, metadata, and presentation.
69
Standards developed by NISO:
–
–
ANSI/NISO Z39.2 – 1994 (R2001) Information
interchange format equivalent international standard:
ISO 2709
ANSI/NISO Z39.19 – 2003 Guidelines for the
construction, format, and management of
Monolingual Thesauri
*Equivalent international standard: ISO 2788
70
C. British Standards Institution (BSI)
– as the National Standards Body of the UK, it develops
standards and applies innovative standardization solutions
to meet the needs of business and society.
Standards developed by BSI (related to library and
information science):
– BS 1749: 1985 Recommendations for alphabetical
arrangement and the filing order of numbers and
symbols
• Provides guidance on arranging entries within lists of
all kinds, e.g. bibliographies, catalogues, directories
and indexes.
–
BS ISO 999: 1996 Information and Documentation –
guidelines for the content, organization and
presentation of indexes
71
Automatic Indexing

-
-
refers to indexing by machine, or the analysis of text by
means of computer algorithms.
The focus is on automatic methods used behind the scenes
with little or no input from individual searchers, with the
exception of relevance feedback.
It does not include searching options and techniques used by
human searches, such as methods for creating effective
search statements, adding weights to terms, specifying
proximity requirements, using truncation, wild cards or
combining terms with Boolean or role operators.
72
Four Types of Approaches
•
Statistical – based on counts of words, statistical
associations, and collation techniques that assigns weights,
cluster similar words
Example:
Tf-idf (term frequency-inverse document frequency), which
is frequency used in many search engines.
The intuitive philosophy behind tf-idf is that terms that
are frequent in many documents are less suited to make
discriminations, while terms that are frequent within a
single document may indicate that this document has much
information about the things the terms are referring to).
Source: Cleveland & Cleveland, 2001, p. 211
73
•
•
•
Syntactical
– stresses grammar and parts of speech, identifying concepts
found in designated grammatical combinations, such as
noun phrases
Semantic systems
– systems are concerned with the context sensitivity of words
in the text
Examples: What does cat mean in terms of its context?
House cats? Heavy earthmoving equipment?
Knowledge-based
– systems goes beyond thesaurus or equivalent relationships
to knowing the relationship between words
Example: ‘tibia’ is part of a leg, thus the document is
74
indexed under ‘leg injuries’.
Human / Manual Indexing vs. Automatic
Indexing
•
•
Automatic methods have trouble handling synonyms,
homonyms, and semantic relations. Conceptualizing is very
poor. Human indexers go through cognitive processes that
may be influenced by their background experience,
education, training, intelligence, and common sense.
Computers can, and humans cannot, organize all words in a
text and in a given database and make statistical operations
on them (e.g. Td-idf).
75
Websites for Indexers
Indexing Services
H.W. Wilson Home Page (http://www.hwwilson.com/)
Wright Information (http://mindspring.com/~jancw/)
Susan Holbert Indexing Services ( http://abbington.com/holbert/)
Special Formats and Subjects Indexing
ASIS Thesaurus of Information Science
(http://www.asis.org/Publications/Thesaurus/isframe.htm)
The Library of Congress Thesauri (http://lcweb.loc.gov/pmei/lexico/liv/bsearch.html)
Standards
National Information Standards Organization (http://www.niso.org/)
ANSI/NISO Z39.41- 1997 Guidelines for Abstracts (http://www.ansi.org/)
ANSI/Z39.4- 1984 Basic Criteria for Indexers (http://www.ansi.org/)
Indexing software
HTML Indexer (for Windows) http://www.html-indexer.com/
Cindex (for DOS, Windows, and Macintosh) http://www.indexres.com
76
www.comicstripgenerator.com
www.comicstripgenerator.com
http://sweetmud.tv/wp-content/plugins/thank-you-animation-for-powerpoint-free
77
Download