Recap • We looked at the indexing process to see how controlled vocabularies can be used to enhance access to information – Different methods of indexing provide different results – Need to decide on your approach based on an analysis of your business objectives, the user needs, and the domain – A combination of automatic and human indexing is often the best solution IMT530- Organization of Information Resources 1 Module 6a: Intro to Controlled Vocabularies, Taxonomies and Classification IMT530: Organization of Information Resources Winter 2008 Michael Crandall Module 6a Outline • • • • • Where we are Controlled vocabularies Types of controlled vocabularies Tagging Overview of building vocabularies IMT530- Organization of Information Resources 3 Overview of Subject Representation • Subject analysis – a technique used to determine the “subject(s)” and disciplinary context exemplified by an object • Subject indexing – a technique through which subject terms (words, taxonomic categories, or notation) are added to an object representation to describe the subject content of the object • Controlled vocabularies – standards containing controlled subject terms (words, taxonomic categories, or notation) used in the indexing process IMT530- Organization of Information Resources 4 Controlled Vocabulary: Definition • A controlled vocabulary is a list of terms (words or phrases) or codes (notation) used for indexing • Almost always, controlled vocabularies show relationships among terms IMT530- Organization of Information Resources 5 Purpose of Controlled Vocabularies • Specific Purposes – To provide access to content by subject, through providing hierarchical and associative relationships and synonym control for the terms used in the domain – Increase precision in retrieval and display by controlling homographs (words that are spelled the same but have different meanings) • General Purposes – Assist users by conveying meaning, orientation, and structure in a subject area – Assist users by providing rich relationships among concepts and terms IMT530- Organization of Information Resources 6 Buckland • Proposes five different vocabularies in any system: – Authors – Indexers – Syndetic structure – Searchers – Formulated queries • Formal tradition vs. document tradition IMT530- Organization of Information Resources 7 Types of Controlled Vocabularies Zeng, M.L. (2005). Construction of controlled vocabularies: A primer. • • • • • Subject Heading List Taxonomy Thesauri Classification Scheme More terminology on Leonard Will’s site – http://www.willpowerinfo.co.uk/glossary.htm IMT530- Organization of Information Resources 8 Subject Heading Lists • General list of terms (words and phrases), not limited by discipline or subject area • Terms are called subject headings • The distinction between thesauri & subject heading lists is largely historical (subject heading lists are older); there are very few subject heading lists because they are so expensive to maintain • Terms are mainly subject attributes, but there are many exemplified attributes used in subdivisions • Example: Library of Congress Subject Headings (LCSH), used in library catalogs – Sample terms: “France – Colonies – History – 18th century”; “Time and space – Juvenile fiction”; “Frogs” (notice the use of subdivisions, marked here by dashes; thesauri seldom use subdivisions) IMT530- Organization of Information Resources 9 Taxonomies • List of terms (words and phrases) that may be general or subject/discipline/domain specific • Terms are called taxons or (simply) terms • Terms represent subjects, disciplines/domains, and exemplified attributes • Used in digital environment only • Examples: Microsoft Corporation intranet taxonomies; Yahoo taxonomy used in the Yahoo directory – Sample terms from the Yahoo taxonomy (in Yahoo, you’ll find these at the top of the screen as you browse through the directory): “Education”; “Science > Agriculture > Research > Government Agencies”; “Health > Nursing”; “Health > Education”; IMT530- Organization of Information Resources 10 Thesauri • Thesauri (pl.) / Thesaurus (s.) – List of terms (words and phrases) that are usually limited to a specific subject or disciplinary area – Terms listed in a thesaurus are often called descriptors – Thesauri were mostly defined and developed after the advent of the computer and were created for use in an computerized environment (or with computers in mind) – Terms are usually subject (about) attributes, but some thesauri also contain exemplified (example of) attributes- http://www.e-government.govt.nz/nzgls/thesauri – Example: ERIC Thesaurus (education) • Sample terms from the ERIC Thesaurus: “School community relationship”; “College entrance exams”; “Age grade placement” IMT530- Organization of Information Resources 11 “Classification” Schemes • Chart of subject categories contextualized by a hierarchical structure • Terms are lists of codes (notation) • Terms are called classes and class numbers • Classification schemes make use of disciplinary, subject, and (sometimes) exemplified attributes • Used often to arrange physical documents; sometimes used in online environments IMT530- Organization of Information Resources 12 “Classification” Example • Examples: Dewey Decimal Classification (DDC); Universal Decimal Classification (UDC); Colon Classification • Sample entries (DDC): – 510 (meaning: “Mathematics” (a discipline and a subject)); – 512.57 (meaning: “Mathematics / Linear, multilinear, multidimensional algebras / Factor algebras”) – 362.582 (meaning: “Social problems and services / Problems of and services to the poor / Financial assistance”) IMT530- Organization of Information Resources 13 Four Types of Classification • Kwasnik describes four classification systems – – – – Hierarchies Trees Paradigms Facets • Paradigms are useful primarily for analysis of subject gaps and relationships in a constrained space • Trees are a poor form of hierarchy with limited relationships • We’ll look at the other two in some detail over the next two weeks IMT530- Organization of Information Resources 14 Hierarchies • Good for representation of knowledge in mature domains where the nature of the entities and relationships are well known • You’ll see examples of these in the thesauri that we will look at in today’s exercise • Require a model that describes what entities are included, with rules of association and distinction • Tend to be monolithic and cumbersome for large domains IMT530- Organization of Information Resources 15 Facets • Actually a different approach rather than a different structure – May use hierarchies or trees as part of the structure – Originated in the work of S.R. Ranganathan • Proposed that any object could be viewed in five ways: personality, matter, energy, space and time (PMEST) – Being used more and more in modern information systems because of flexibility in meeting multiple needs IMT530- Organization of Information Resources 16 Collaborative Tagging • Golder and Huberman point out issues of “basic level” and “collective sensemaking” • Tug of war between personal storage – Identifying qualities – Self reference – Task organizing • and public nature of access – – – – What or who it is about What it is Who owns it Categories • Stability emerges from imitation and shared experience IMT530- Organization of Information Resources 17 Trees vs. Tags • Weinberger’s article postulates three types of vocabularies – Trees (hierarchies) – Facets – Tags • Golder/Huberman and Weinberger both point out that each approach can be useful in particular situations – Choosing your approach is part of the process of subject and domain analysis IMT530- Organization of Information Resources 18 Steps in Constructing CVs • Define your domain • Gather concepts – From user interviews, search logs, content analysis, preexisting vocabularies • • • • • Select your approach Extract terminology Control your terms Organize your terms Maintain, maintain, maintain IMT530- Organization of Information Resources 19 Questions? • If not, take a break!!! IMT530- Organization of Information Resources 20 Exercise 6a • Purpose is to explore some existing controlled vocabularies to investigate their differences and similarities, how useful they might be for subject access, and to become familiar with the structure of controlled vocabularies in general • Spend the next 45 minutes on Exercise 6a • Ask questions and talk!!! • Be sure to hand in completed work at the end of class for credit!!! IMT530- Organization of Information Resources 21 Next Week • We’ll start to look at ways to build controlled vocabularies and the rules associated with them • Remember to read assignments BEFORE class IMT530- Organization of Information Resources 22