Implementing a Taxonomy in a Content Management Portal Content Week 2005 Miami, Florida Monday, January 31, 2005 Workshop H 2:45pm – 4:45 pm Marjorie M.K. Hlava Access Innovations, Inc. 505-998-0800 mhlava@accessinn.com www.accessinn.com Introductions • • • • • Name Project Expectations for these two short hours Please fill in the sign up sheet Would you like – 1. Copy of this presentation? – 2. Sample software? – 3. Other information? What will we talk about this afternoon? • • • • • • • 1.Definitions 2.Where taxonomy fits in the Information Circle 3.Where to use a taxonomy 4.Taxonomies for Communities of Practice 5.Surrounding theories and applications 6.How to build and maintain 7.How is used in enterprise information Copyright © 2005 Access Innovations, Inc. Implementing a Taxonomy in a Content Management Portal Thesaurus Master Database Management System Add Metadata using MAI Data Feed MAI to add Metadata Inverted File 1. Definitions Copyright © 2005 Access Innovations, Inc. What is a taxonomy? • A hierarchical thesaurus with authority terms applied at the final node • A browse-able web interface • A Linnaean System • A browse- able list with the term instance at the final leaf Copyright © 2005 Access Innovations, Inc. Types of Taxonomies • Naming and organizing things into groups that share similar characteristics • 1. Flat – just a list • 2. Hierarchical – Taxonomic view • 3. Faceted – Sorted by a single charasteristic – Metadata - Dublin Core – COSATI -GILS • 4. Thesaurus – Term records – Database backend – Easier to modify and maintain Copyright © 2005 Access Innovations, Inc. Taxonomy in meta data • Definition – Taxonomy is a thesaurus in its hierarchical view with the authority files applied at the final nodes – It allows the browse-able front end to a portal – It provides keyword and name access to the content in the portal Copyright © 2005 Access Innovations, Inc. Taxonomy definition • A taxonomy is a thesaurus in hierarchical view with authority file terms added at the final nodes • Thesaurus • Authority file • Hierarchical form • Final nodes Copyright © 2005 Access Innovations, Inc. Thesaurus • Concepts • Methods • Procedures • Cognitive approach • The knowledge capture piece • The topics or subjects Copyright © 2005 Access Innovations, Inc. Authority file • People • Places • Things • The tangible approach • Concrete Entities Copyright © 2005 Access Innovations, Inc. Hierarchical view • Gives the Portal view • The view of all the preferred terms in categorized order • An outline of the thesaurus Copyright © 2005 Access Innovations, Inc. Final Nodes • The last position on the hierarchical tree – Taxonomy • concept – narrower terms » final node - people, place or thing term » document instance » Letter to George Wiesman Dec 12, 2003 » Technical report number TR-1039 » Museum artifact 1706 wodden wagon wheel Copyright © 2005 Access Innovations, Inc. Term Records – the Database Part • Associative terms – Related terms • Equivalence terms – Preferred and non preferred – Use and used for – Synonyms • Hierarchical terms – Broader narrower terms – Parent Child Copyright © 2005 Access Innovations, Inc. Other term record fields • • • • • • Scope notes Cross references History Term Status Category User defined Copyright © 2005 Access Innovations, Inc. 2. Where does a taxonomy fit in the information circle? Copyright © 2005 Access Innovations, Inc. Information Circle - Overview Content User Taxonomy Output Copyright © 2005 Access Innovations, Inc. Content •Web Pages •White Papers •Research Reports •Licensed Data Feeds Content •Intranet •Internal Reports •Lotus Notes files •Databases •Public Relations Documents/Press Releases •Market Research Reports •Customer Relationship Management (CRM) •HR Files User •Accounting/Financial Records •Legal Documents •Patents •Museum artifacts Taxonomy Output Copyright © 2005 Access Innovations, Inc. Content – cont’d Content Taxonomy Content Creation: HTML – Meta name / Keywords DB – Field / Meta tag / Element XML – Entity table for valid values User Output Copyright © 2005 Access Innovations, Inc. Taxonomy Content Taxonomy Taxonomy is applied to new and existing content: Meta Tags Rule Base User Thesaurus Terms Authority Terms Date Output Author Description etc. Copyright © 2005 Access Innovations, Inc. Taxonomy Taxonomy – cont’d Content Taxonomy Index data - Manually - Automatically Suggest new candidate terms User Review Output Copyright © 2005 Access Innovations, Inc. Output Content User Taxonomy Output Searchable Data - Internal Data - External Data Copyright © 2005 Access Innovations, Inc. User Content Taxonomy Web Browsing/Searching Database Browsing/Searching Query Resolution User Output Copyright © 2005 Access Innovations, Inc. User – cont’d Content User Input - Suggested Candidate Terms - New Documents Taxonomy Reports Based on User Search - Search Logs - Null Hits (These will also suggest new candidate terms) User Output Copyright © 2005 Access Innovations, Inc. New Content New Content Taxonomy The cycle begins again User Output Copyright © 2005 Access Innovations, Inc. Information Circle - Overview Content User Taxonomy Output Copyright © 2005 Access Innovations, Inc. 3. Where to use a taxonomy • • • • • • • • • • • Link the Taxonomy and Indexing Always in sync with the industry Keep up to date with terminology Automatically index the old data Filter newsfeeds Search using the Taxonomy File using the taxonomy Spell check using the taxonomy Link to translation system Catalog using the taxonomy Index a book Copyright © 2005 Access Innovations, Inc. Copyright © 2005 Access Innovations, Inc. Copyright © 2005 Access Innovations, Inc. Copyright © 2005 Access Innovations, Inc. Thesaurus Master Copyright © 2005 Access Innovations, Inc. Copyright © 2005 Access Innovations, Inc. Database Management System - Add Metadata using MAI Database records Each with many elements Record locator Accessinn.com/12345/demofile/recid15 Inverted File Aadvark Alligator Apple Advantage …. Zebra Portal Searching Copyright © 2005 Access Innovations, Inc. Many data bases can be reached Database records Each with many elements Record locator Accessinn.com/12345/demofile/recid15 Inverted File Aadvark Alligator Apple Advantage …. Zebra Portal Searching Copyright © 2005 Access Innovations, Inc. 4. Taxonomies for Communities of Practice Copyright © 2005 Access Innovations, Inc. Taxonomies in a Community of Practice • • • • • Nature of Communities of Practice (CoP) Taxonomies in context Value of taxonomies Creating a taxonomy Applying the taxonomy Copyright © 2005 Access Innovations, Inc. Nature of CoPs • Free flowing, loosely structured • Simple, ad hoc categorization • Active CoPs need organization • Search tends to be hit-or-miss Courtesy of Lillian Gassie, Naval Postgraduate School, Monterey, CA Copyright © 2005 Access Innovations, Inc. Taxonomies in Context A taxonomy aspires to be: • a correlation of the different functional, regional and (possibly) national languages used by a community of practice • a support mechanism for navigation • a support tool for search engines and knowledge maps • an authority for tagging documents and other information objects • a knowledge base in its own right Reference: “Taxonomies: the vital tool of information architecture”, www.tfpl.com Copyright © 2005 Access Innovations, Inc. Value of Taxonomies • • • • • Improves organization & structure Facilitates navigation Facilitates knowledge discovery Reduces effort Saves time “Taxonomies are better created by professional indexers or librarians than by domain experts.” Copyright © 2005 Access School, Innovations, Inc. Courtesy of Lillian Gassie, Naval Postgraduate Monterey, CA Naval Postgraduate School’s Homeland Security Taxonomy (1) Copyright © 2005 Access Innovations, Inc. Naval Postgraduate School’s Homeland Security Taxonomy (2) Copyright © 2005 Access Innovations, Inc. IBM Insight graphical view Copyright © 2005 Access Innovations, Inc. Applying a Taxonomy (1) Manually • Add terms into meta data fields • Design navigation & site indexes with taxonomy hierarchy Courtesy of Lillian Gassie, Naval Postgraduate School, Monterey, CA Copyright © 2005 Access Innovations, Inc. Incorporating Hierarchical Classification from a Taxonomy Courtesy of Lillian Gassie, Naval Postgraduate School, Monterey, CA Applying a Taxonomy (2) System integration • Search & retrieval systems • Auto-assignment of metadata • Categorization systems Courtesy of Lillian Gassie, Naval Postgraduate School, Monterey, CA Applying the Taxonomy to a Digital Library INTERNET (public) Library catalogs Locally held documents Public repositories Commercial data sources Agency data sources Search engine Search engine Search engine Search engine Search engine spiders Filtered content Search engine Meta-Search Tool Automated categorization Web portal Courtesy of Lillian Gassie, Naval Postgraduate School, Monterey, CA 5. Surrounding theories and applications Copyright © 2005 Access Innovations, Inc. Other Vocabulary types • • • • Uncontrolled lists Classification System Subject headings Controlled vocabulary – usually synonyms and spelling • Authority files • Thesaurus • Taxonomy Copyright © 2005 Access Innovations, Inc. Uncontrolled list - define • Add terms as they occur • No cross reference • Simple flat structure Copyright © 2005 Access Innovations, Inc. Controlled term lists - defined • • • • • • State the preferred terms Provide allowed term entry Heavily cross referenced Not generally hierarchical Popular Easy to create Copyright © 2005 Access Innovations, Inc. Controlled term list - format • Cars – use Automobiles • Personal Computer – use Microcomputer Copyright © 2005 Access Innovations, Inc. Classification vs Subject Headings • Classification – single spot or placement – browse physical list – often a numbering system – clear hierarchy – no or few cross references Copyright © 2005 Access Innovations, Inc. Classification vs Subject Headings • Subject headings – generic search – hidden classification system – related terms and cross references in heavy use – Usually the inverted form • cells, electric – Alphabetic access Copyright © 2005 Access Innovations, Inc. Authority systems - defined • • • • • Lists of terms in the preferred format for use Frequently have cross references Widely available Frequently coded lists Brand names Copyright © 2005 Access Innovations, Inc. Authority lists - examples • ISO Country Name and Code – International Standards Organization • ISO Language list • NAICS (SIC) – Standard Industrial Classification Code (SIC) – Replaced by – North American Industrial Classification System (NAICS) Copyright © 2005 Access Innovations, Inc. What is a thesaurus? • Jessica L. Milstead. All Rights Reserved • “For writers, it is a tool like Roget’s one with words grouped and classified to help select the best word to convey a specific nuance of meaning. • For indexers and searchers, it is an information storage and retrieval tool: a listing of words and phrases authorized for use in an indexing system, together with relationships, variants and synonyms, and aids to navigation through the thesaurus” • www.jelem.com Copyright © 2005 Access Innovations, Inc. Thesaurus - defined • For information retrieval 1960’s – indexing either intellectual or automatic – in searching – searching but not indexing – indexing but not searching – hierarchical view for searching Copyright © 2005 Access Innovations, Inc. Thesaurus - defined • Monolingual - standard – British – English - ISO 5578 – American – English –ANSI/NISO Z39.19 • Multilingual – standard ISO 5579 – concept mapping – Eurovoc • Discipline or Mission based - ad hoc Copyright © 2005 Access Innovations, Inc. Thesaurus -standard format • • • • • • • • Main Entries Top Terms - TT Broader Terms - BT Narrower Terms - NT RELATED TERMS - RT Scope Notes - SN History - HI Date term added/changed - DA Copyright © 2005 Access Innovations, Inc. Standards • Monolingual – NISO / ANSI – Z39.19 – ISO 5578 • Multilingual – ISO 5579 Copyright © 2005 Access Innovations, Inc. ISO Standards • Set up already - easy to adopt • Multiple broader terms • The standards outline procedures – ISO -better for implementation – NISO much better reading Copyright © 2005 Access Innovations, Inc. Why do we index ? • Improve precision – define scope of terms • Improve recall – different terms for same concept • Guide to a field of expertise • Learning tool • Richer expression Copyright © 2005 Access Innovations, Inc. Uses ? • Indexing* – …process by which subject terms or classification symbols are assigned to concepts in documents – A thesaurus is also known as an indexing language – * not the building of the inverted file in computer sense of indexing Copyright © 2005 Access Innovations, Inc. What are we controlling ? • Synonyms – different terms same concept • Polysemes or Homonyms – same word different meanings – Lead – Reading Copyright © 2005 Access Innovations, Inc. How ? • Meaning – delineation of scope of a term • Term equivalence – linking of synonyms • Disambiguation of homonyms – lead (metal) – lead (element) – lead (management) Copyright © 2005 Access Innovations, Inc. Precision options • Language specificity • Coordination • Compound terms - level of precoordination • Homographs and scope notes • Word distance indication Copyright © 2005 Access Innovations, Inc. Precision options • • • • Structural relationships Links and roles Treatment and aspect codes Weighting Copyright © 2005 Access Innovations, Inc. Disambiguation Bill Invoice Bill Legislative Bill Sport Bill Person Copyright © 2005 Access Innovations, Inc. Disambiguation Bills PT Invoices NT Bills BT Legislation RT Bill RT Animal NT Bill BT Person Copyright © 2005 Access Innovations, Inc. 6. How to build and maintain a taxonomy Copyright © 2005 Access Innovations, Inc. How to build a taxonomy • • • • • • • Collect the terms Pull out authority terms Organize into arrays Choose top terms Organize hierarchically Flesh out term records Test, review, and edit Copyright © 2005 Access Innovations, Inc. Or said another way … • • • • • • • Define scope Collect terms and relationships Identify existing taxonomies Identify resources Create & refine taxonomy Apply taxonomy Review and update Copyright © 2005 Access Innovations, Inc. Maintain • Steady stream of terms – – – – – – – Web logs Null sets New announcements Indexing team Library Records managers Etc. • Candidate terms • Out of date is nearly useless Copyright © 2005 Access Innovations, Inc. Best Results Measures • • • • • • • Accuracy Productivity Hits, Misses and Noise Precision (Recall) Relevance Ease of set up Time to production Copyright © 2005 Access Innovations, Inc. Integration • Thesaurus – – – – full featured multiple views multiple versions multiple languages • Automatic indexing – filtering – assisted • Data Harmony MAI and Thesaurus Master Copyright © 2005 Access Innovations, Inc. Visual Taxonomy Taxonomy Visual • Ways to look – Hierarchical – Alphabetic – by term – Ring diagrams – Topic maps – Related terms Copyright © 2005 Access Innovations, Inc. API to Many Systems for CMS Copyright © 2005 Access Innovations, Inc. Apply to the meta data • • • • • Automatic application? Spider setting internally External web crawls – use all aliases Filter data Enhance search experience Copyright © 2005 Access Innovations, Inc. Meta data • The fields • The elements – Class codes – Title – Author – Plaintiff – Product – subject / topic • Meta Name Keywords in HTML Copyright © 2005 Access Innovations, Inc. Copyright © 2005 Access Innovations, Inc. 7. How Taxonomies are used in Enterprise Information Copyright © 2005 Access Innovations, Inc. Brand is repeated in several spots and tied to search as well Copyright © 2005 Access Innovations, Inc. Another way of listing brands Category list from taxonomy is tied to brand list and product list Category code from the taxonomy is tied to the brand list and the product list Enterprise Taxonomy Management • • • • • • Consistent application across entire site Synonyms are used interchangeably User doesn’t need to know the taxonomy Pop up view is helpful Site map for construction and browsing Allows hidden sections for internal use Copyright © 2005 Access Innovations, Inc. Taxonomies • • • • • Form the basis for knowledge sharing Add value to discussion Allow deeper retrieval Are straightforward to create Require on-going maintenance Copyright © 2005 Access Innovations, Inc. Your Taxonomy • There is too much information to pile it on the floor. • It fits in many places in the information flow Copyright © 2005 Access Innovations, Inc. Copyright © 2005 Access Innovations, Inc. Implementing a Taxonomy in a Content Management Portal Thesaurus Master Database Management System Add Metadata using MAI Data Feed MAI to add Metadata Inverted File Thank you for your time! Questions? Marjorie M.K. Hlava Access Innovations, Inc. 505-998-0800 mhlava@accessinn.com www.accessinn.com Copyright © 2005 Access Innovations, Inc.