Module 5b: Intro to Controlled Vocabularies, Taxonomies and

advertisement
Recap
• We looked at the indexing process to
see how controlled vocabularies can be
used to enhance access to information
– Different methods of indexing provide
different results
– Need to decide on your approach based on
an analysis of your business objectives, the
user needs, and the domain
– A combination of automatic and human
indexing is often the best solution
IMT530- Organization of Information Resources
1
Module 6a: Intro to Controlled Vocabularies,
Taxonomies and Classification
IMT530: Organization of Information Resources
Winter 2008
Michael Crandall
Module 6a Outline
•
•
•
•
•
Where we are
Controlled vocabularies
Types of controlled vocabularies
Tagging
Overview of building vocabularies
IMT530- Organization of Information Resources
3
Overview of Subject
Representation
• Subject analysis
– a technique used to determine the “subject(s)” and
disciplinary context exemplified by an object
• Subject indexing
– a technique through which subject terms (words,
taxonomic categories, or notation) are added to an
object representation to describe the subject
content of the object
• Controlled vocabularies
– standards containing controlled subject terms
(words, taxonomic categories, or notation) used in
the indexing process
IMT530- Organization of Information Resources
4
Controlled Vocabulary: Definition
• A controlled vocabulary is a list of terms
(words or phrases) or codes (notation)
used for indexing
• Almost always, controlled vocabularies
show relationships among terms
IMT530- Organization of Information Resources
5
Purpose of Controlled
Vocabularies
• Specific Purposes
– To provide access to content by subject, through
providing hierarchical and associative relationships
and synonym control for the terms used in the
domain
– Increase precision in retrieval and display by
controlling homographs (words that are spelled the
same but have different meanings)
• General Purposes
– Assist users by conveying meaning, orientation,
and structure in a subject area
– Assist users by providing rich relationships among
concepts and terms
IMT530- Organization of Information Resources
6
Buckland
• Proposes five different vocabularies in
any system:
– Authors
– Indexers
– Syndetic structure
– Searchers
– Formulated queries
• Formal tradition vs. document tradition
IMT530- Organization of Information Resources
7
Types of Controlled Vocabularies
Zeng, M.L. (2005). Construction of controlled vocabularies: A primer.
•
•
•
•
•
Subject Heading List
Taxonomy
Thesauri
Classification Scheme
More terminology on Leonard Will’s site
– http://www.willpowerinfo.co.uk/glossary.htm
IMT530- Organization of Information Resources
8
Subject Heading Lists
• General list of terms (words and phrases), not limited
by discipline or subject area
• Terms are called subject headings
• The distinction between thesauri & subject heading
lists is largely historical (subject heading lists are
older); there are very few subject heading lists
because they are so expensive to maintain
• Terms are mainly subject attributes, but there are
many exemplified attributes used in subdivisions
• Example: Library of Congress Subject Headings
(LCSH), used in library catalogs
– Sample terms: “France – Colonies – History – 18th century”;
“Time and space – Juvenile fiction”; “Frogs” (notice the use
of subdivisions, marked here by dashes; thesauri seldom use
subdivisions)
IMT530- Organization of Information Resources
9
Taxonomies
• List of terms (words and phrases) that may be general
or subject/discipline/domain specific
• Terms are called taxons or (simply) terms
• Terms represent subjects, disciplines/domains, and
exemplified attributes
• Used in digital environment only
• Examples: Microsoft Corporation intranet
taxonomies; Yahoo taxonomy used in the Yahoo
directory
– Sample terms from the Yahoo taxonomy (in Yahoo, you’ll find
these at the top of the screen as you browse through the
directory): “Education”; “Science > Agriculture > Research >
Government Agencies”; “Health > Nursing”; “Health >
Education”;
IMT530- Organization of Information Resources
10
Thesauri
• Thesauri (pl.) / Thesaurus (s.)
– List of terms (words and phrases) that are usually
limited to a specific subject or disciplinary area
– Terms listed in a thesaurus are often called
descriptors
– Thesauri were mostly defined and developed after
the advent of the computer and were created for
use in an computerized environment (or with
computers in mind)
– Terms are usually subject (about) attributes, but
some thesauri also contain exemplified (example
of) attributes- http://www.e-government.govt.nz/nzgls/thesauri
– Example: ERIC Thesaurus (education)
• Sample terms from the ERIC Thesaurus: “School
community relationship”; “College entrance exams”; “Age
grade placement”
IMT530- Organization of Information Resources
11
“Classification” Schemes
• Chart of subject categories contextualized by
a hierarchical structure
• Terms are lists of codes (notation)
• Terms are called classes and class numbers
• Classification schemes make use of
disciplinary, subject, and (sometimes)
exemplified attributes
• Used often to arrange physical documents;
sometimes used in online environments
IMT530- Organization of Information Resources
12
“Classification” Example
• Examples: Dewey Decimal Classification
(DDC); Universal Decimal Classification
(UDC); Colon Classification
• Sample entries (DDC):
– 510 (meaning: “Mathematics” (a discipline and a
subject));
– 512.57 (meaning: “Mathematics / Linear,
multilinear, multidimensional algebras / Factor
algebras”)
– 362.582 (meaning: “Social problems and services
/ Problems of and services to the poor / Financial
assistance”)
IMT530- Organization of Information Resources
13
Four Types of Classification
• Kwasnik describes four classification systems
–
–
–
–
Hierarchies
Trees
Paradigms
Facets
• Paradigms are useful primarily for analysis of subject
gaps and relationships in a constrained space
• Trees are a poor form of hierarchy with limited
relationships
• We’ll look at the other two in some detail over the next
two weeks
IMT530- Organization of Information Resources
14
Hierarchies
• Good for representation of knowledge in
mature domains where the nature of the
entities and relationships are well known
• You’ll see examples of these in the thesauri
that we will look at in today’s exercise
• Require a model that describes what entities
are included, with rules of association and
distinction
• Tend to be monolithic and cumbersome for
large domains
IMT530- Organization of Information Resources
15
Facets
• Actually a different approach rather than
a different structure
– May use hierarchies or trees as part of the
structure
– Originated in the work of S.R. Ranganathan
• Proposed that any object could be viewed in
five ways: personality, matter, energy, space
and time (PMEST)
– Being used more and more in modern
information systems because of flexibility in
meeting multiple needs
IMT530- Organization of Information Resources
16
Collaborative Tagging
• Golder and Huberman point out issues of “basic level”
and “collective sensemaking”
• Tug of war between personal storage
– Identifying qualities
– Self reference
– Task organizing
• and public nature of access
–
–
–
–
What or who it is about
What it is
Who owns it
Categories
• Stability emerges from imitation and shared
experience
IMT530- Organization of Information Resources
17
Trees vs. Tags
• Weinberger’s article postulates three types of
vocabularies
– Trees (hierarchies)
– Facets
– Tags
• Golder/Huberman and Weinberger both point
out that each approach can be useful in
particular situations
– Choosing your approach is part of the process of
subject and domain analysis
IMT530- Organization of Information Resources
18
Steps in Constructing CVs
• Define your domain
• Gather concepts
– From user interviews, search logs, content
analysis, preexisting vocabularies
•
•
•
•
•
Select your approach
Extract terminology
Control your terms
Organize your terms
Maintain, maintain, maintain
IMT530- Organization of Information Resources
19
Questions?
• If not, take a break!!!
IMT530- Organization of Information Resources
20
Exercise 6a
• Purpose is to explore some existing controlled
vocabularies to investigate their differences
and similarities, how useful they might be for
subject access, and to become familiar with
the structure of controlled vocabularies in
general
• Spend the next 45 minutes on Exercise 6a
• Ask questions and talk!!!
• Be sure to hand in completed work at the end
of class for credit!!!
IMT530- Organization of Information Resources
21
Next Week
• We’ll start to look at ways to build
controlled vocabularies and the rules
associated with them
• Remember to read assignments
BEFORE class
IMT530- Organization of Information Resources
22
Download