Lecture 06: Controlled Vocabularies
Introduction
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 am
Fall 2002
IS 202 - FALL 2002
Some slides in this lecture were developed by Prof. Marti Hearst
2002.09.12 - SLIDE 1
• Review
– Dublin Core
– Other Metadata Systems
• Controlled Vocabularies
• Name Authority Files
– Choice of Names
– Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
Vocabularies
IS 202 - FALL 2002 2002.09.12 - SLIDE 2
• Review
– Metadata Systems
– Dublin Core
• Controlled Vocabularies
• Name Authority Files
– Choice of Names
– Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
Vocabularies
IS 202 - FALL 2002 2002.09.12 - SLIDE 3
Metadata Systems and Standards
• Naming and ID systems – URLS, ISBNS
• Bibliographic description – MARC, Dublin
Core, TEI, etc.
• Music – SMDL
• Images and objects – CIMI, VRA core categories
• Numeric data – DDI, SDSM
• Geospatial data – FGDC
• Collections – EAD
IS 202 - FALL 2002 2002.09.12 - SLIDE 4
• Simple metadata for describing internet resources
• For “Document-Like Objects”
• 15 Elements (in base DC)
IS 202 - FALL 2002 2002.09.12 - SLIDE 5
• Title
• Creator
• Subject
• Description
• Publisher
• Other Contributors
• Date
• Resource Type
• Format
• Resource Identifier
• Source
• Language
• Relation
• Coverage
• Rights
Management
IS 202 - FALL 2002 2002.09.12 - SLIDE 6
• Label: TITLE
• The name given to the resource by the
CREATOR or PUBLISHER
IS 202 - FALL 2002 2002.09.12 - SLIDE 7
• Label: CREATOR
• The person(s) or organization(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.
IS 202 - FALL 2002 2002.09.12 - SLIDE 8
• Label: SUBJECT
• The topic of the resource, or keywords or phrases that describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data
(for example, Library of Congress Classification
Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as Medical Subject Headings or Art and
Architecture Thesaurus descriptors) as well.
IS 202 - FALL 2002 2002.09.12 - SLIDE 9
• Label: DESCRIPTION
• A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.
IS 202 - FALL 2002 2002.09.12 - SLIDE 10
• Label: PUBLISHER
• The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.
IS 202 - FALL 2002 2002.09.12 - SLIDE 11
• Label: CONTRIBUTORS
• Person(s) or organization(s) in addition to those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specified in the CREATOR element (for example, editors, transcribers, illustrators, and convenors).
IS 202 - FALL 2002 2002.09.12 - SLIDE 12
• Label: DATE
• The date the resource was made available in its present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be
19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.
IS 202 - FALL 2002 2002.09.12 - SLIDE 13
• Label: RESOURCE TYPE
• The category of the resource, such as home page, novel, poem, working paper, preprint, technical report, essay, dictionary.
It is expected that RESOURCE TYPE will be chosen from an enumerated list of types. One preliminary set of such types can be found at the following URL (now out of date): http://www.roads.lut.ac.uk/Metadata/DC-
ObjectTypes.html
IS 202 - FALL 2002 2002.09.12 - SLIDE 14
• Label: FORMAT
• The data representation of the resource, such as text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with
RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media
Types (MIME types). In principal, formats can include physical media such as books, serials, or other nonelectronic media.
IS 202 - FALL 2002 2002.09.12 - SLIDE 15
• Label: IDENTIFIER
• String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard
Book Numbers (ISBN) or other formal names would also be candidates for this element.
IS 202 - FALL 2002 2002.09.12 - SLIDE 16
• Label: SOURCE
• The work, either print or electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearean sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.
IS 202 - FALL 2002 2002.09.12 - SLIDE 17
• Label: LANGUAGE
• Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the
Z39.53 three character codes for written languages. See: http://www.sil.org/sgml/nisoLang3-1994.html
IS 202 - FALL 2002 2002.09.12 - SLIDE 18
• Label: RELATION
• Relationship to other resources. The intent of specifying this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. A formal specification of
RELATION is currently under development.
Users and developers should understand that use of this element should be currently considered experimental.
IS 202 - FALL 2002 2002.09.12 - SLIDE 19
• Label: COVERAGE
• The spatial locations and temporal duration characteristic of the resource.
Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element should be currently considered experimental.
IS 202 - FALL 2002 2002.09.12 - SLIDE 20
• Label: RIGHTS
• The content of this element is intended to be a link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.
IS 202 - FALL 2002 2002.09.12 - SLIDE 21
• Lack of guidance on what to put into each element
• How to structure or organize at the element level?
• How to ensure consistency across descriptions for the same persons, places, things, etc.
IS 202 - FALL 2002 2002.09.12 - SLIDE 22
• Structures and languages for the description of information resources and their elements (components or features)
• “Metadata is information on the organization of the data, the various data domains, and the relationship between them” (Baeza-Yates p. 142)
IS 202 - FALL 2002 2002.09.12 - SLIDE 23
• Often two main types of metadata are distinguished:
– Descriptive metadata
• Describes the information/data object and its properties
• May use a variety of descriptive formats and rules
– Topical metadata
• Describes the topic or “aboutness” of an information/data object
• May include a variety of vocabularies for describing, subjects, topics, categories, etc.
IS 202 - FALL 2002 2002.09.12 - SLIDE 24
• Review
– Metadata Systems
– Dublin Core
• Controlled Vocabularies
• Name Authority Files
– Choice of Names
– Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
Vocabularies
IS 202 - FALL 2002 2002.09.12 - SLIDE 25
• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information
• That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata
IS 202 - FALL 2002 2002.09.12 - SLIDE 26
• Names and name authorities
• Gazetteers (geographic names)
• Code lists (e.g., LC language codes)
• Subject heading lists
• Classification schemes
• Thesauri
IS 202 - FALL 2002 2002.09.12 - SLIDE 27
• Review
– Metadata Systems
– Dublin Core
• Controlled Vocabularies
• Name Authority Files
– Choice of Names
– Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
Vocabularies
IS 202 - FALL 2002 2002.09.12 - SLIDE 28
• Remember Cutter’s objectives of bibliographic description?
– To enable a person to find a document of which the author is known
– To show what the library has by a given author
• First serves access
• Second serves collocation
IS 202 - FALL 2002 2002.09.12 - SLIDE 29
• How many names should be associated with a document?
• Which of these should be the “ main entry ?”
• What form should each of the names take?
• What references should be made from other possible forms of names that haven’t been used?
IS 202 - FALL 2002 2002.09.12 - SLIDE 30
• Proliferation of the forms of names
– Different names for the same person
– Different people with the same names
• Examples
– from Books in Print (semi-controlled but not consistent)
– ERIC author index (not controlled)
IS 202 - FALL 2002 2002.09.12 - SLIDE 31
IS 202 - FALL 2002
…etc…
2002.09.12 - SLIDE 32
IS 202 - FALL 2002 2002.09.12 - SLIDE 33
IS 202 - FALL 2002 2002.09.12 - SLIDE 34
IS 202 - FALL 2002 2002.09.12 - SLIDE 35
• AACR II and other sets of descriptive cataloging rules provide guidelines for:
– Determining the number of name entries
– Choosing a main entry
– Deciding on the form of name to be used
– Deciding when to make references
IS 202 - FALL 2002 2002.09.12 - SLIDE 36
• Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives (also know as established ) based on some set of rules
• If you have rules, why do you need to keep track of all of the headings? Can’t you just infer the headings from the rules?
IS 202 - FALL 2002 2002.09.12 - SLIDE 37
• Single person or single corporate entity
• Unknown or anonymous authors
– Fictitiously ascribed works
• Shared responsibility
• Collections or editorially assembled works
• Works of mixed responsibility (e.g., translations)
• Related works
IS 202 - FALL 2002 2002.09.12 - SLIDE 38
• Personal names
– Collaborators
– Editors, compilers, writers
– Translators (in some cases)
– Illustrators (in some cases)
– Other persons associated with the work (such as the honoree in a festschrift)
• Corporate names
– Any prominently named corporate body that has involvement in the work beyond publication, distribution, etc.
IS 202 - FALL 2002 2002.09.12 - SLIDE 39
• AACR II says that the predominant form of the name used in a particular author’s writings should be chosen as the form of name
• References should be made from the other forms of the name
IS 202 - FALL 2002 2002.09.12 - SLIDE 40
• When names appear in multiple forms, one form needs to be chosen
• Criteria for choice are:
– Fullness (e.g., full names vs. initials only)
– Language of the name
– Spelling (choose predominant form)
• Entry element:
– John Smith or Smith, John?
– Mao Zedong or Zedong, Mao? (Mao Tse
Tung?)
IS 202 - FALL 2002 2002.09.12 - SLIDE 41
ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242
KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80
RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
VST:d 08-21-91 Other Versions: earlier
040 DLC$cDLC$dDLC$dOCoLC
053 PR6005.R517
100 10 Creasey, John
400 10 Cooke, M. E.
400 10 Cooke, Margaret,$d1908-1973
400 10 Cooper, Henry St. John,$d1908-1973
Different names for the same person
400 00 Credo,$d1908-1973
400 10 Fecamps, Elise
400 10 Gill, Patrick,$d1908-1973
400 10 Hope, Brian,$d1908-1973
400 10 Hughes, Colin,$d1908-1973
400 10 Marsden, James
400 10 Matheson, Rodney
400 10 Ranger, Ken
400 20 St. John, Henry,$d1908-1973
400 10 Wilde, Jimmy
500 10 $wnnnc$aAshe, Gordon,$d1908-1973
IS 202 - FALL 2002 2002.09.12 - SLIDE 42
ID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048
KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91
RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
VST:d 08-19-91
040 OCoLC$cOCoLC
100 10 Marric, J. J.,$d1908-1973
500 10 $wnnnc$aCreasey, John
663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John
670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J
.J. Marric)
670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric)
670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)
IS 202 - FALL 2002 2002.09.12 - SLIDE 43
ID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124
KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81
RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
VST:d 06-06-91 Other Versions: earlier
040 DLC$cDLC$dDLC$dOCoLC
100 10 Butler, William Vivian,$d1927-
400 10 Butler, W. V.$q(William Vivian),$d1927-
400 10 Marric, J. J.,$d1927-
670 His The durable desperadoes, 1973.
670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler)
670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J
.J. Marric)
Different people writing with the same name
IS 202 - FALL 2002 2002.09.12 - SLIDE 44
1. Paine, Lauran.
ALSO KNOWN AS:
Carrel, Mark.
Thompson, Russ.
Andrews, A. A.
Benton, Will.
Bradford, Will.
Bradley, Concho.
Brennan, Will.
Carter, Nevada.
Allen, Clay.
Almonte, Rosa.
Armour, John.
Cassady, Claude.
Glendenning, Donn.
Kelley, Ray.
Kilgore, John.
Martin, Tom.
Slaughter, Jim.
Standish, Buck.
…
IS 202 - FALL 2002
Batchelor, Reg.
Beck, Harry.
Bedford, Kenneth.
Bosworth, Frank.
Bovee, Ruth.
Cassidy, Claude.
Custer, Clint.
Dana, Amber.
Dana, Richard.
Davis, Audrey.
Drexler, J. F.
Duchesne, Antoinette.
Fisher, Margot.
Fleck, Betty.
Frost, Joni.
Gordon, Angela.
Gorman, Beth.
Hayden, Jay.
Houston, Will.
Howard, Troy.
Ingersol, Jared.
…
Kelly, Ray.
Ketchum, Jack.
Liggett, Hunter.
Lucas, J. K.
Lyon, Buck.
Morgan, Arlene.
Morgan, Valerie.
O'Connor, Clint.
St. George, Arthur.
Sharp, Helen.
Thorn, Barbara.
Archer, Dennis.
Clark, Badger.
2002.09.12 - SLIDE 45
IS 202 - FALL 2002 2002.09.12 - SLIDE 46
• Review
– Dublin Core
– Other Metadata Systems
• Controlled Vocabularies
• Name Authority Files
– Choice of Names
– Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
Vocabularies
IS 202 - FALL 2002 2002.09.12 - SLIDE 47
Search
Line
Interest profiles
& Queries
Formulating query in terms of descriptors
Information Storage and Retrieval System
Rules of the game =
Rules for subject indexing +
Thesaurus (which consists of
Lead-In
Vocabulary and
Indexing
Language
Storage of profiles
Documents
& data
Storage
Line
Indexing
(Descriptive and
Subject)
Storage of
Documents
Store1: Profiles/
Search requests
Store2: Document representations
Comparison/
Matching
Potentially
Relevant
Documents
Adapted from Soergel, p. 19
2002.09.12 - SLIDE 48 IS 202 - FALL 2002
• Library subject headings, classification, and authority files
• Commercial journal indexing services and databases
• Yahoo, and other web classification schemes
• Online and manual systems within organizations
– SunSolve
– MacArthur
IS 202 - FALL 2002 2002.09.12 - SLIDE 49
• Uncontrolled keyword indexing
• Indexing languages
– Controlled, but not structured
• Thesauri
– Controlled and structured
• Classification systems
– Controlled, structured, and coded
• Faceted thesauri and classification systems
IS 202 - FALL 2002 2002.09.12 - SLIDE 50
• An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents
• An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms
IS 202 - FALL 2002 2002.09.12 - SLIDE 51
• Library of Congress Subject Headings
• Yellow pages topics
• Wilson indexes (“reader’s guide”)
IS 202 - FALL 2002 2002.09.12 - SLIDE 52
• A thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among
– Synonymous
– Equivalent
– Broader
– Narrower, and
– Other related terms
IS 202 - FALL 2002 2002.09.12 - SLIDE 53
• National and international standards for thesauri
– ANSI/NISO z39.19 -- 1994 -- American National
Standard Guidelines for the Construction, Format and
Management of Monolingual Thesauri
– ANSI/NISO Draft Standard Z39.4-199x -- American
National Standard Guidelines for Indexes in
Information Retrieval
– ISO 2788 -- Documentation -- Guidelines for the establishment and development of monolingual thesauri
– ISO 5964 -- Documentation -- Guidelines for the establishment and development of multilingual thesauri
IS 202 - FALL 2002 2002.09.12 - SLIDE 54
• Examples:
– The ERIC Thesaurus of Descriptors
– The Art and Architecture Thesaurus
– The Medical Subject Headings (MESH) of the
National Library of Medicine
IS 202 - FALL 2002 2002.09.12 - SLIDE 55
• A classification system is an indexing language often based on a broad ordering of topical areas
• Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics
• Classification schemes commonly use a coded notation for representing a topic and it’s place in relation to other terms
IS 202 - FALL 2002 2002.09.12 - SLIDE 56
• Examples:
– The Library of Congress Classification
System
– The Dewey Decimal Classification System
– The ACM Computing Reviews Categories
– The American Mathematical Society
Classification System
IS 202 - FALL 2002 2002.09.12 - SLIDE 57
• Start with the text of the document
• Attempt to “control” or regularize:
– The concepts expressed within
• mutually exclusive
• exhaustive
– The language used to express those concepts
• limit the normal linguistic variations
• regulate word order and structure of phrases
• reduce the number of synonyms or near-synonyms
• Also, provide cross-references between concepts and their expression
(These slides follow Bates 88)
Slide author: Marti Hearst
IS 202 - FALL 2002 2002.09.12 - SLIDE 58
• Classify possible concepts.
• Goals:
– Completely distinct conceptual categories
(mutually exclusive)
– Complete coverage of conceptual categories
(exhaustive)
Slide author: Marti Hearst
2002.09.12 - SLIDE 59 IS 202 - FALL 2002
Assigning Headings vs. Descriptors
• Subject headings
– Assign one (or a few) complex heading(s) to the document
• Descriptors
– Mix and match
How would we describe recipes using each technique?
Slide author: Marti Hearst
2002.09.12 - SLIDE 60 IS 202 - FALL 2002
• Wilsonline
– Athletes
– Athletes -- Heath&hygiene
– Athletes -- Nutrition
– Athletes -- Physical Exams
– …
– Athletics
– Athletics -- Administration
– Athletics -- Equipment --
Catalogs
– …
– Sports -- Accidents and
Injuries
– Sports -- Accidents and
Injuries -- Prevention
• ERIC
– Athletes
– Athletic Coaches
– Athletic Equipment
– Athletic Fields
– Athletics
– …
– Sports Psychology
– Sportsmanship
Slide author: Marti Hearst
2002.09.12 - SLIDE 61 IS 202 - FALL 2002
• Describe the contents of an entire document
• Designed to be looked up in an alphabetical index
– Look up document under its heading
• Few (1-5) headings per document
• Describe one concept within a document
• Designed to be used in Boolean searching
– Combine to describe the desired document
• Many (5-25) descriptors per document
Slide author: Marti Hearst
2002.09.12 - SLIDE 62 IS 202 - FALL 2002
• Review
– Dublin Core
– Other Metadata Systems
• Controlled Vocabularies
• Name Authority Files
– Choice of Names
– Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
Vocabularies
IS 202 - FALL 2002 2002.09.12 - SLIDE 63
• Each category is successively broken down into smaller and smaller subdivisions
• No item occurs in more than one subdivision
• Each level divided out by a “character of division” (also known as a feature)
– Example:
• Distinguish “Literature” based on:
– Language
– Genre
– Time Period
Slide author: Marti Hearst
2002.09.12 - SLIDE 64 IS 202 - FALL 2002
Literature
French English Spanish
...
Prose Poetry Drama ...
Prose Poetry Drama
...
...
...
16th 17th 18th 19th 16th 17th 18th 19th
Slide author: Marti Hearst
2002.09.12 - SLIDE 65 IS 202 - FALL 2002
Labeled Categories for Hierarchical
Classification
• LITERATURE
– 100 English Literature
• 110 English Prose
– English Prose 16th Century
– English Prose 17th Century
– English Prose 18th Century
– ...
• 111 English Poetry
– 121 English Poetry 16th Century
– 122 English Poetry 17th Century
– ...
• 112 English Drama
– 130 English Drama 16th Century
– …
– 200 French Literature
IS 202 - FALL 2002
Slide author: Marti Hearst
2002.09.12 - SLIDE 66
• Create a separate, free-standing list for each characteristic or division (feature)
• Combine features to create a classification
Slide author: Marti Hearst
2002.09.12 - SLIDE 67 IS 202 - FALL 2002
Faceted Classification Along With Labeled
Categories
• Aa English Literature • A Language
– a English
– b French
– c Spanish
• B Genre
– a Prose
– b Poetry
– c Drama
• C Period
– a 16th Century
– b 17th Century
– c 18th Century
– d 19th Century
• AaBa English Prose
• AaBaCa English Prose
16th Century
• AbBbCd French Poetry
19th Century
• BbCd Drama 19th
Century
Slide author: Marti Hearst
2002.09.12 - SLIDE 68 IS 202 - FALL 2002
• How to use both types of classification structures?
• How to look through them?
• How to use them in search?
Slide author: Marti Hearst
2002.09.12 - SLIDE 69 IS 202 - FALL 2002
• Multimedia Information Organization and
Retrieval (MED)
• Readings for next time (in Protected)
– “Indexing the Content of Multimedia
Documents” (S. W. Smoliar, L. D. Wilcox)
– “Computational Media Aesthetics: Finding
Meaning Beautiful” (C. Dorai, S. Venkatesh)
– “The Holy Grail of Content-Based Media
Analysis” (S. Chang)
IS 202 - FALL 2002 2002.09.12 - SLIDE 70
• Do Readings
• Receive and integrate feedback on
Assignment 2 to iterate your Photo Use
Scenario (nothing to turn in on this yet)
• Assignment 3: Photo Metadata Design
– Due by Thursday, September 19
IS 202 - FALL 2002 2002.09.12 - SLIDE 71