Taxonomy Strategies LLC Workshop: Why and How to Use Dublin Core for Enterprise-Wide Metadata Applications Ron Daniel & Joseph Busch Taxonomy Strategies May 22, 2005 Copyright 2005 Taxonomy Strategies LLC. All rights reserved. Workshop goals 1. What is the Dublin Core? 2. Answer these enterprise-wide metadata ROI questions: What is the value proposition for adding metadata to content? Does metadata make content reusable? Findable? Improve productivity? How can metadata value be measured in a way that quantifies how it contributes to the bottom line? 3. Answer these Business process questions: How is Dublin Core tagging being done on content to expose metadata to portals, search engines, and other metadata-aware applications? How are metadata value spaces (controlled vocabularies) maintained within an enterprise? Across enterprises? 4. Answer these technology questions: What tools exist to use Dublin Core and other metadata standards in enterprise information management environments? TAXONOMY STRATEGIES LLC The business of organized information 2 Agenda 3:30 3:45 4:00 4:30 4:45 5:00 5:15 5:30 6:15 6:30 6:45 Introductions: Us and you Background: Metadata & controlled vocabularies Dublin Core: Elements, issues, and recommendations Dublin Core in the wild: CEN study and remarks Enterprise-wide metadata ROI questions Break ROI (Cont.) Business processes Tools & technologies Q&A Adjourn TAXONOMY STRATEGIES LLC The business of organized information 3 Who we are: Joseph Busch Over 25 years in the business of organized information Founder, Taxonomy Strategies Director, Solutions Architecture, Interwoven VP, Infoware, Metacode Technologies (acquired by Interwoven, November 2000) Program Manager, Getty Foundation Manager, Pricewaterhouse Metadata and taxonomies community leadership President, American Society for Information Science & Technology Director, Dublin Core Metadata Initiative Adviser, National Research Council Computer Science and Telecommunications Board Reviewer, National Science Foundation Division of Information and Intelligent Systems Founder, Networked Knowledge Organization Systems/Services TAXONOMY STRATEGIES LLC The business of organized information 4 Who we are: Ron Daniel, Jr. Over 15 years in the business of metadata & automatic classification Principal, Taxonomy Strategies Standards Architect, Interwoven Senior Information Scientist, Metacode Technologies (acquired by Interwoven, November 2000) Technical Staff Member, Los Alamos National Laboratory Metadata and taxonomies community leadership Chair, PRISM (Publishers Requirements for Industry Standard Metadata) working group Acting chair: XML Linking working group Member: RDF working groups Co-editor: PRISM, XPointer, 3 IETF RFCs, and Dublin Core 1 & 2 reports. TAXONOMY STRATEGIES LLC The business of organized information 5 Recent & current projects Government Commodity Futures Trading Commission Defense Intelligence Agency ERIC Federal Aviation Administration Federal Reserve Bank of Atlanta Forest Service GSA Office of Citizen Services (www.firstgov.gov) Head Start Infocomm Development Authority of Singapore NASA (nasataxonomy.jpl.nasa.gov) Small Business Administration Social Security Administration USDA Economic Research Service USDA e-Government Program (www.usda.gov) TAXONOMY STRATEGIES LLC The business of organized information Commercial Allstate Insurance Blue Shield of California Debevoise & Plimpton Halliburton Hewlett Packard Motorola PeopleSoft Pricewaterhouse Coopers Siderean Software Sprint Time Inc. Commercial subcontracts Agency.com – Top financial services Critical Mass – Fortune 50 retailer Deloitte Consulting – Big credit card Gistics/OTB – Direct selling giant NGO’s CEN IDEAlliance IMF OCLC 6 What we do Organize Stuff TAXONOMY STRATEGIES LLC The business of organized information 7 Who are you? Tell us: Your name Your organization Your job title The things you want to get from this workshop TAXONOMY STRATEGIES LLC The business of organized information 8 Agenda 3:30 3:45 4:00 4:30 4:45 5:00 5:15 5:30 6:15 6:30 6:45 Introductions: Us and you Background: Metadata & controlled vocabularies Dublin Core: Elements, issues, and recommendations Dublin Core in the wild: CEN study and remarks Enterprise-wide metadata ROI questions Break ROI (Cont.) Business processes Tools & technologies Q&A Adjourn TAXONOMY STRATEGIES LLC The business of organized information 9 Metadata: Different definitions Library & Information Science Author/Title/Subject Controlled Vocabularies for Subject Codes (e.g. Dewey) Authority Files for Author Names Database Tables/Columns/ Datatypes/Relationships References for some values TAXONOMY STRATEGIES LLC The business of organized information 10 Metadata: Why it matters “Adding metadata to unstructured content allows it to be managed like structured content. Applications that use structured content work better.” “Enriching content with structured metadata is critical for supporting search and personalized content delivery.” “Content that has been adequately tagged with metadata can be leveraged in usage tracking, personalization and improved searching.” “Better structure equals better access: Taxonomy serves as a framework for organizing the ever-growing and changing information within a company. The many dimensions of taxonomy can greatly facilitate Web site design, content management, and search engineering. If well done, taxonomy will allow for structured Web content, leading to improved information access.” TAXONOMY STRATEGIES LLC The business of organized information 11 Metadata: Supports core functions Complexity Subject metadataBetter – Use metadata – &When & How: What, Where &navigation Why: Subject, Title, Description, discovery Date, Language, Rights Coverage Asset metadata – Who: More efficient Relational metadata Links between and to: Creator, Publisher,editorial Contributor, Type, Format, Source, Relation process Identifier – Enabled Functionality http://dublincore.org/documents/dces/ TAXONOMY STRATEGIES LLC The business of organized information 12 What is a taxonomy? Systematics view Hierarchical classification of things into a tree structure Animalia Chordata Mammalia Carnivora Canidae Canis C. familiari Kingdom Phylum Class Order Family Genus Species Linnaeus … 44-Office Equipment and Accessories and Supplies .12-Office Supplies .17-Writing Instruments .05-Mechanical pencils .06-Wooden pencils .07-Colored pencils Segment Family Class Commodity UNSPSC … TAXONOMY STRATEGIES LLC The business of organized information 13 Agenda 3:30 3:45 4:00 4:30 4:45 5:00 5:15 5:30 6:15 6:30 6:45 Introductions: Us and you Background: Metadata & controlled vocabularies Dublin Core: Elements, issues, and recommendations Dublin Core in the wild: CEN study and remarks Enterprise-wide metadata ROI questions Break ROI (Cont.) Business processes Tools & technologies Q&A Adjourn TAXONOMY STRATEGIES LLC The business of organized information 14 Dublin Core: A little more complicated Elements Refinements 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Abstract Access rights Alternative Audience Available Bibliographic citation Conforms to Created Date accepted Date copyrighted Date submitted Education level Extent Has format Has part Has version Is format of Is part of Identifier Title Creator Contributor Publisher Subject Description Coverage Format Type Date Relation Source Rights Language Encodings Types Is referenced by Is replaced by Is required by Issued Is version of License Mediator Medium Modified Provenance References Replaces Requires Rights holder Spatial Table of contents Temporal Valid TAXONOMY STRATEGIES LLC The business of organized information Box DCMIType DDC IMT ISO3166 ISO639-2 LCC LCSH MESH Period Point RFC1766 RFC3066 TGN UDC URI W3CTDF Collection Dataset Event Image Interactive Resource Moving Image Physical Object Service Software Sound Still Image Text 15 Dublin Core framework for corporate use Not just 15 elements A framework to enable cross-resource exploration and use Dublin Core is framework for “integration metadata” at BellSouth Source: Todd Stephens, BellSouth TAXONOMY STRATEGIES LLC The business of organized information 16 Metadata: A data specification – a recipe example Element Data Type Length Req. / Repeat Source Purpose Asset Metadata Unique ID Integer Fixed dc:identifier 1 System supplied Basic accountability Recipe Title dc:title Variable String 1 Licensed Content Text search & results display Recipe summary dc:description String Variable 1 Licensed Content Content Main Ingredients X List ? Main Ingredients vocabulary Key index to retrieve & aggregate recipes, & generate shopping list Variable Subject Metadata Meal Types ListX Variable * Meal Types vocab Cuisines ListX Variable * Cuisines Courses ListX Variable * Courses vocab Cooking Method X Flag Fixed * Cooking vocab Browse or group recipes & filter search results Link Metadata Recipe Image Pointer Variable dcterms:hasPart ? Product Group Merchandize products Use Metadata Rating String Variable Release Date dc:dateFixed Date 1 Licensed Content Filter, rank, & evaluate recipes 1 Product Group Publish & feature new recipes dc:type=“recipe”, dc:format=“text/html”, Legend: ? – 1 or more * -dc:language=“en” 0 or more TAXONOMY STRATEGIES LLC The business of organized information 17 Why Dublin Core? Dublin Core is a de-facto standard Taxonomies, Vocabularies, Ontologies across many other systems and standards Dublin Core and Similar RSS (1.0), OAI Inside organizations – portals, CMS, … Mapping to DC elements from most existing schemes is simple Beware of force-fits Why will metadata already exist? Because of search projects, portal integration projects, etc. that are creating it or standardizing a mapping. TAXONOMY STRATEGIES LLC The business of organized information Source: Todd Stephens, BellSouth Per-Source Data Types, Access Controls, etc. 18 Creator “An entity primarily responsible for making the content of the resource” In other words – Author, Photographer, Illustrator, … Refinements None Encodings Potential refinements by creative role Rarely justified Creators can be persons or organizations Key Point – Reminder: Name variations are a big issue in data quality: Ron Daniel Ron Daniel, Jr. Ron Daniel Jr. R.E. Daniel None Name fields may contain other information <dc:creator>Case, W. R. (NASA Goddard Space Flight Center, Greenbelt, MD, United States)</dc:creator> Best practice – Validate names against LDAP or other “Authority File” Ronald Daniel Ronald Ellison Daniel, Jr. Daniel, R. TAXONOMY STRATEGIES LLC The business of organized information 19 Example – Name mismatches One of these things is not like the other: Ron Daniel, Jr. and Carl Lagoze; “Distributed Active Relationships in the Warwick Framework” Hojung Cha and Ron Daniel; “Simulated Behavior of Large Scale SCI Rings and Tori” Ron Daniel; “High Performance Haptic and Teleoperative Interfaces” Differences may not matter If they do This error cannot be reliably detected automatically Authority files and an error-correction procedure are needed TAXONOMY STRATEGIES LLC The business of organized information 20 Contributor “An entity responsible for making contributions to the content of the resource.” Refinements None Encodings In practice – rarely used. Difficult to distinguish from Creator. Adds UI Complexity for no real gain None Best Practice? Recommendation – Don’t use. TAXONOMY STRATEGIES LLC The business of organized information 21 Publisher “An entity responsible for making the resource available”. Refinements None Problems: Encodings All the name-handling stuff of Creator. Hierarchy of publishers (Bureau, Agency, Department, …) TAXONOMY STRATEGIES LLC The business of organized information None 22 Title Refinements “A name given to the resource”. Issues: Alternative Hierarchical Titles e.g. Conceptual Structures: Information Processing in Mind and Machine (The Systems Programming Series) Encodings None Untitled Works Metaphysics TAXONOMY STRATEGIES LLC The business of organized information 23 Identifier Refinements “An unambiguous reference to the resource within a given context” Bibliographic Citation Best Practice: URL Encodings Future Best Practice: URI? Problems URI Metaphysics Personalized URLs Multiple identifiers for same content Non-standard resolution mechanisms for URIs Recommendations – Plan how to introduce long-lived URLs TAXONOMY STRATEGIES LLC The business of organized information 24 Date “A date associated with an event in the life cycle of the resource” Woefully underspecified. Typically the publication or last modification date. Best practice: YYYY-MM-DD Refinements Created Valid Available Issued Modified Date Accepted Date Copyrighted Date Submitted Encodings DCMI Period W3C DTF (Profile of ISO 8601) TAXONOMY STRATEGIES LLC The business of organized information 25 Subject Refinements The topic of the content of the resource. Best practice: Use pre-defined subject schemes, not userselected keywords. None Encodings Supported Encodings probably not useful for most corporate needs Factor “Subject” into separate facets. People, places, organizations, events, objects, services Industry sectors Content types, audiences, functions Topic DDC LCC LCSH MESH UDC Some of the facets are already defined in DC (Coverage, Type) or DCTERMS (Audience) TAXONOMY STRATEGIES LLC The business of organized information 26 Coverage “The extent or scope of the content of the resource”. In other words – places and times as topics. Key Point – Locations important in SOME environments, irrelevant in others. Time periods as subjects rarely important in commercial work. Refinements Spatial Temporal Encodings Box (for Spatial) ISO3166 (for Spatial) Point (for Spatial) TGN (for Spatial) W3CTDF (for Temporal) Best Practice – ISO 3166-1, 3166-2 TAXONOMY STRATEGIES LLC The business of organized information 27 Description “An account of the content of the resource”. In other words – an abstract or summary Key Point – What’s the cost/benefit tradeoff for creating descriptions? Refinements Abstract Table of Contents Encodings None Quality of auto-generated descriptions is low For search results, hit highlighting is probably better TAXONOMY STRATEGIES LLC The business of organized information 28 Type “The nature or genre of the content of the resource” Best Current Practice: Create a custom list of content types, use that list for the values. Try to avoid “image”, “audio”, and Refinements None Encodings DCMI Type other format names in the list of content types, they can be derived from “Format”. No broadly-acceptable list yet found. TAXONOMY STRATEGIES LLC The business of organized information 29 Format “The physical or digital manifestation of the resource.” In other words – the file format Refinements Extent Medium Best practice: Internet Media Types Outliers: File sizes, dimensions of physical objects TAXONOMY STRATEGIES LLC The business of organized information Encodings IMT 30 Language “A language of the intellectual content of the resource”. Refinements None Best Practice: ISO 639, RFC 3066 Dialect codes: Advanced practice TAXONOMY STRATEGIES LLC The business of organized information Encodings ISO639-2 RFC1766 RFC3066 31 Relation Refinements “A reference to a related resource” Very weak meaning – not even as strong as “See also”. Best practice: Use a refinement element and URLs. Is Version Of Has Version Is Replaced By Replaces Is Required By Requires Is Part Of Has Part Is Referenced By References Is Format Of Has Format Conforms To Encodings URI TAXONOMY STRATEGIES LLC The business of organized information 32 Source “A reference to a resource from which the present resource is derived” Original intent was for derivative works Refinements None Encodings URI Frequently abused to provide bibliographic information for items extracted from a larger work, such as articles from a Journal TAXONOMY STRATEGIES LLC The business of organized information 33 Rights “Information about rights held in and over the resource” Could be a copyright statement, or a list of groups with access rights, or … Refinements Access Rights License Encodings None TAXONOMY STRATEGIES LLC The business of organized information 34 Agenda 3:30 3:45 4:00 4:30 4:45 5:00 5:15 5:30 6:15 6:30 6:45 Introductions: Us and you Background: Metadata & controlled vocabularies Dublin Core: Elements, issues, and recommendations Dublin Core in the wild: CEN study and remarks Enterprise-wide metadata ROI questions Break ROI (Cont.) Business processes Tools & technologies Q&A Adjourn TAXONOMY STRATEGIES LLC The business of organized information 35 Taxonomy Strategies LLC CEN/ISSS Workshop on Dublin Core. Guidance information for the deployment of Dublin Core metadata in Corporate Environments http://www.cenorm.be/cenorm/businessdomains /businessdomains/isss/cwa/cwa15247.asp May 22, 2005 Copyright 2005 Taxonomy Strategies LLC. All rights reserved. Dublin Core: CEN/ISSS Workshop on Dublin Core Metadata – corporate uses Applied Information Technique AstraZenica BBC BellSouth Cisco Daimler Chrysler Giunti Labs GSK Halliburton TAXONOMY STRATEGIES LLC The business of organized information HP IBM Intel John Wiley & Sons Lilly PeopleSoft Rohm Haas SAP Software AG Unisys 37 How is Dublin Core used in corporate environments? 60% 57% 50% 43% 43% 40% 29% 30% 20% 10% 0% De facto Simple Base: 20 corporate information managers Access enabler Compliance CEN/ISSS Workshop on Dublin Core – Guidance information for the deployment of Dublin Core metadata in Corporate Environments TAXONOMY STRATEGIES LLC The business of organized information 38 Taxonomy: e-Forms example Agency 0001 Legislative 1000 Judicial 1100 Executive Office of Pres 0003 Exec Depts 1200 Agriculture 1300 Commerce 9700 Defense 9100 Education 8900 Energy 7500 HHS 7000 DHS 8600 HUD 1400 Interior 1500 Justice 1600 Labor 1900 State 6900 Transport 2000 Treasury 3600 Veterans Ind Agencies Intl Orgs Form Type Industry Impact Jurisdiction Application Approval Claim Information request Information submission Instructions Legal filing Payment Procurement Renewal Reservation Service request Test Other input Other transaction 00 Generic 11 Agriculture 21 Mining 22 Utilities 23 Construct 31-33 Manuf 42 Wholesale 44-45 Retail 48-49 Trans 51 Info 52 Finance 54 Profession 55 Mgmt 56 Support 61 Education 62 Health Care 71 Arts 72 Hospitality 81 Other Services 92 Public Admin Facets Federal State + Local + Other + BRM Impact Keyword Topic Citizen Srvcs Social Srvs Defense Disasters Econ Dev Education Energy Env Mgmt Law Enf Judicial Correctional Health Security Income Sec Intelligence Intl Affairs Nat Resour Transport Workforce Science Delivery Support Management Agriculture & food Commerce Communications Education Energy Env pro Foreign rels Govt Health & safety Housing & comm dev Labor Law Named grps National def Nat resources Recreation Sci & tech Social pgms Transport Audience All General Citizen Business Govt Employee Native American Nonresident Tourist Special group Controlled Vocabularies TAXONOMY STRATEGIES LLC The business of organized information 39 How Dublin Core is extended? 120% 100% 100% 86% 80% 60% 57% 57% Roles Inconsistent Encoding 40% 20% 0% Doc Types Products & Services Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core – Guidance information for the deployment of Dublin Core metadata in Corporate Environments TAXONOMY STRATEGIES LLC The business of organized information 40 Custom business process document types? Ouch! Oil & gas services company document types analysis, appraisals, assessments, forecasts, predictions agendas, plans, designs, schedules, workflow applications, proposals, requests, requirements permits, consents, approvals, rejections, certificates work orders, correspondence auditing, compliance, testing, inspections, operations reports lessons learned, after-action reviews, meeting minutes, FAQs policies, procedures, training manuals, standards, best practices research notes, journal articles newsletters, bulletins, press releases ads, brochures, data sheets, technical notes, case studies, price lists checklists, templates, forms, logos, branding software, database forms TAXONOMY STRATEGIES LLC The business of organized information 41 The power of taxonomy facets 4 independent categories of 10 nodes each have the same discriminatory power as one hierarchy of 10,000 nodes (104) Easier to maintain Can be easier to navigate TAXONOMY STRATEGIES LLC The business of organized information 42 Taxonomic metadata example: Form SS-4. Employer Identification Number (EIN) Facet Values Agency IRS Content Type Information Submission Industry Impact Generic Jurisdiction Federal Programs & Services Support Delivery of Services/General Government/Taxation Management Keyword Topic Commerce/Employment taxes Audience Business TAXONOMY STRATEGIES LLC The business of organized information 43 Agenda 3:30 3:45 4:00 4:30 4:45 5:00 5:15 5:30 6:15 6:30 6:45 Introductions: Us and you Background: Metadata & controlled vocabularies Dublin Core: Elements, issues, and recommendations Dublin Core in the wild: CEN study and remarks Enterprise-wide metadata ROI questions Break ROI (Cont.) Business processes Tools & technologies Q&A Adjourn TAXONOMY STRATEGIES LLC The business of organized information 44 Fundamentals of metadata ROI Tagging content using metadata and a taxonomy are costs, not benefits. There is no benefit without exposing the tagged content to users in some way that cuts costs or improves revenues. Putting metadata and a taxonomy into operation requires UI changes and/or backend system changes, as well as data changes. You need to determine those changes, and their costs, as part of the ROI. TAXONOMY STRATEGIES LLC The business of organized information 45 Common metadata ROI scenarios Catalog site Increased sales. Increased productivity. Customer support Cutting costs. Increased sales. Compliance Avoiding penalties. Knowledge worker productivity Less time searching, more time working. Executive Mandate No ROI study, just someone with a vision and a budget. TAXONOMY STRATEGIES LLC The business of organized information 46 Metadata ROI: Catalog site Guided Navigation 2-3 clicks to product No dead ends http://www.tesco.com/winestore TAXONOMY STRATEGIES LLC The business of organized information 47 Metadata ROI: Catalog site Increased sales Enterprise portal cost Product findability. Product cross-sells and up- $6M sells. Customer loyalty. 1-5% increase in sales $57.6B sales (’04) $2.1B net income (’04) $600M to $2B/year $21M to $105M/year $50K average cost per employee 310,400 employees (’04) $155M to $776M/year 1-5% increase in productivity TAXONOMY STRATEGIES LLC The business of organized information 48 Metadata ROI: Customer support model Help on search page, not a click away. Type and go to search for specific policies Policy categories for browsing Refine search offered with results Good search results for policy topics, e.g., “pets” TAXONOMY STRATEGIES LLC The business of organized information 49 Metadata ROI: Customer support model Self service Manual processing Fewer customer calls. Faster, more accurate CSR responses through better information access. 100,000 documents 2 pages per document $4 per page $800K 25-50% service efficiency increase 300K customer service calls per month $6 cost per call $5.4M to $10.8M/yr $186M to $930M/year ($575M) to $169M/year 1-5% increased sales $18.6B sales (’04) ($761M) net income (’04) TAXONOMY STRATEGIES LLC The business of organized information 50 Metadata ROI: Compliance Avoiding penalties for breaching regulations SOX: up to 5 years in jail SOX: up to $5M Following required procedures Loss of company $100B revenue (’00) $100B Loss of partner companies Arthur Andersen TAXONOMY STRATEGIES LLC The business of organized information 51 Knowledge workers spend up to 2.5 hours each day looking for information … Communicating Searching Creating … But find what they are looking for only 40% of the time. — Kit Sims Taylor TAXONOMY STRATEGIES LLC The business of organized information 52 High cost of not finding information “The amount of time wasted in futile searching for vital information is enormous, leading to staggering costs …” — Sue Feldman, High cost of poor classification Poor classification costs a 10,000 user organization $10M each year—about $1,000 per employee. — Jakob Nielsen, useit.com But “better search” itself is a weak ROI TAXONOMY STRATEGIES LLC The business of organized information 53 Knowledge workers spend more time re-creating existing content than creating new content Communicating Recreating existing content 26% Searching Creating new content 9% — Kit Sims Taylor TAXONOMY STRATEGIES LLC The business of organized information 54 Metadata ROI: Productivity Decreased cost to market Decreased development cost Increased R&D productivity Reduced time for sales & marketing Enterprise document management system cost $10M 1-5% decrease in drug development cost $800M/drug 5-10% increase in R&D $8M to $16M/drug $254M to $507M/year $254M to $507M/year productivity 13% of revenue $39B in sales (’04) 10-20% decrease in time for sales & marketing 13% of revenue TAXONOMY STRATEGIES LLC The business of organized information 55 Metadata FAQ: Executive mandate is key There is no ROI out of the box Just someone with a vision …and the budget to make it happen. What’s really needed? Demos and proofs of value. So that a stronger cost benefit argument can be made for continuing the work TAXONOMY STRATEGIES LLC The business of organized information 56 Metadata FAQ: How do you sell it? Don’t sell “metadata” or “taxonomy”, sell the vision of what you want to be able to do. Clearly understand what the problem is and what the opportunities are. Do the calculus (costs and benefits) Design the taxonomy (in terms of LOE) in relation to the value at hand. TAXONOMY STRATEGIES LLC The business of organized information 57 Agenda 3:30 3:45 4:00 4:30 4:45 5:00 5:15 5:30 6:15 6:30 6:45 Introductions: Us and you Background: Metadata & controlled vocabularies Dublin Core: Elements, issues, and recommendations Dublin Core in the wild: CEN study and remarks Enterprise-wide metadata ROI questions Break ROI (Cont.) Business processes Tools & technologies Q&A Adjourn TAXONOMY STRATEGIES LLC The business of organized information 58 Overview of metadata practices Identify the team Use (or map to) Dublin Core for basic information. Extend with custom elements for specific facts. Use pre-existing, standard, vocabularies as much as possible. ISO country codes for locations Product & service info from ERP system Validate author names with LDAP directory Design a QC Process Start with an error-correction process, then get more formal on error detection Large-scale ontologies may be valuable in automated error detection TAXONOMY STRATEGIES LLC The business of organized information 59 Factor “Subject” into smaller facets Size DMOZ tries to organize all web content, has more than 600k categories! Difficulty in navigating, maintaining Hidden facet structure “Classification Schemes” vs. “Taxonomies” TAXONOMY STRATEGIES LLC The business of organized information 60 Sources for 7 common vocabularies dc:publisherVocabulary Definition Potential Sources Organization Organizational structure. FIPS 95-2, U.S. Government Manual, Your organizational structure, etc. Content Type Structured list of the various types of content being managed or used. DC Types, AGLS Document Type, AAT Information Forms , Records management policy, etc. Broad market categories such as lines of business, life events, or industry codes. FIPS 66, SIC, NAICS, etc. Location Place of operations or constituencies. FIPS 5-2, FIPS 55-3, ISO 3166, UN Statistics Div, US Postal Service, etc. Function Functions and processes performed to accomplish mission and goals. FEA Business Reference Model, Enterprise Ontology, AAT Functions, etc. Topic Business topics relevant to your mission and goals. Federal Register Thesaurus, NAL Agricultural Thesaurus, LCSH, etc. Audience Subset of constituents to whom a piece of content is directed or intended to be used. GEM, ERIC Thesaurus, IEEE LOM, etc. Names of products/programs & services. ERP system, Your products and services, etc. dc:type Industry dc:coverage dc:subject dcterms:audience Products and Services TAXONOMY STRATEGIES LLC The business of organized information 61 Cheap and Easy Metadata Some fields will be constant across a collection. In the context of a single collection those kinds of elements add no value, but they add tremendous value when many collections are brought together into one place, and they are cheap to create and validate. TAXONOMY STRATEGIES LLC The business of organized information 62 Taxonomy Business Processes • Taxonomies must change, gradually, over time if they are to remain relevant • Maintenance processes need to be specified so that the changes are based on rational cost/benefit decisions • A team will need to maintain the taxonomy on a parttime basis • Taxonomy team reports to some other steering committee TAXONOMY STRATEGIES LLC The business of organized information 63 Definitions about the Controlled Vocabulary Governance Environment 1: Syndicated Terminologies change on their own schedule Syndicated Terminologies ISO 3166-1 Other External Change Requests & Responses Published CVs and STs Web CMS 2: CV Team decides when to update CVs Archives Intranet Search Vocabulary Management System ERMS ’ Notifications CVs ERP 3: Team adds value via mappings, translations, synonyms, training materials, etc. Custodians Other Internal Consuming Applications Other Controlled Items Intranet Nav. DAM … 4: Updated versions of CVs published to consuming applications … ’ Controlled Vocabulary Governance Environment TAXONOMY STRATEGIES LLC The business of organized information 64 Other Controlled Items Taxonomy Team will have additional items to manage: Charter, Goals, Performance Measures Editorial rules Team processes Tagger training materials (manual and automatic) Outreach & ROI Communication plan Website Presentations Announcements Roadmap TAXONOMY STRATEGIES LLC The business of organized information 65 Taxonomy governance | Generic team charter Taxonomy Team is responsible for maintaining: The Taxonomy, a multi-faceted classification scheme Associated taxonomy materials, such as: Editorial Style Guide Taxonomy Training Materials Metadata Standard Team rules and procedures (subject to CIO review) Team evaluates costs and benefits of suggested change Taxonomy Team will: Manage relationship between providers of source vocabularies and consumers of the Taxonomy Identify new opportunities for use of the Taxonomy across the Enterprise to improve information management practices Promote awareness and use of the Taxonomy TAXONOMY STRATEGIES LLC The business of organized information 66 Other Controlled Items - Editorial Rules To ensure consistent style, rules are needed Issues commonly addressed in the rules: Sources of Terms Abbreviations Ampersands Capitalization Continuations (More… or Other…) Duplicate Terms Hierarchy and Polyhierarchy Languages and Character Sets Length Limits “Other” – Allowed or Forbidden? Plural vs. Singular Forms Relation Types and Limits Scope Notes Serial Comma Spaces Synonyms and Acronyms Term Arrangement (Alphabetic or …) Term Label Order (Direct vs. Inverted) Must also address issue of what to do when rules conflict – which are more important? TAXONOMY STRATEGIES LLC The business of organized information Rule Name Editorial Rule Use Existing Vocabularies Other things being equal, reusing an existing vocabulary is preferred to creating a new one. Ampersands The character '&' is preferred to the word ‘and’ in Term Labels. Example: Use Type: “Manuals & Forms”, not “Manuals and Forms”. Special Characters Retain accented characters in Term Labels. Example: España Serial comma If a category name includes more than two items, separate the items by commas. The last item is separated by the character ‘&’ which IS NOT preceded by a comma. Example: “Education, Learning & Employment”, not “Education, Learning, & Employment”. Capitalization Use title case (where all words except articles are capitalized). Example: “Education, Learning & Employment” NOT “Education, learning & employment” NOT “EDUCATION, LEARNING & EMPLOYMENT” NOT “education, learning & employment” … … 67 Roles in Two Taxonomy Governance Teams Executive Sponsor Advocate for the taxonomy team Business Lead Taxonomy Specialist Keeps team on track with larger business objectives Balances cost/benefit issues to decide appropriate levels of effort Content Owner Obtains needed resources if those in team can’t accomplish a particular task Technical Specialist Estimates costs of proposed changes in terms of amount of data to be retagged, additional storage and processing burden, software changes, etc. Helps obtain data from various systems Content Specialist Team’s liaison to content creators Estimates costs of proposed changes in terms of editorial process changes, additional or reduced workload, etc. Small-scale Metadata QA Responsibility TAXONOMY STRATEGIES LLC The business of organized information Reality check on process change suggestions Team structure at a different org. Specialists help in estimating costs Suggests potential taxonomy changes based on analysis of query logs, indexer feedback Makes edits to taxonomy, installs into system with aid of IT specialist Business Lead Custodians Responsible for content in a specific CV. Training Representative Develops communications plan, training materials Work Practices Representative Develops processes, monitors adherence IT Representative Backups, admin of CV Tool Info. Mgmt. Representative Provides CV expertise, tie-in with larger IM effort in the organization. 68 Taxonomy governance | Where changes come from Firewall Application UI Tagging UI Content Application Logic Tagging Logic Taxonomy Staff notes ‘missing’ concepts Query log analysis End User Recommendations by Editor 1. Small taxonomy changes (labels, synonyms) 2. Large taxonomy changes (retagging, application changes) 3. New “best bets” content Tagging Staff Taxonomy Editor Taxonomy Team TAXONOMY STRATEGIES LLC The business of organized information Team considerations 1. Business goals 2. experience Changes in user experience 3. Retagging cost Requests from other Requests from other parts of NASA parts of the organization 69 Principles Basic facets with identified items – people, places, projects, instruments, missions, organizations, … Note that these are not subjective “subjects”, they are objective “objects”. Clearly identify the Custodians of the facets, and the process for maintain and publishing them. Subjective views can be laid on top of the objective facts, but should be in a different namespace so they are clearly distinguishable. For example, labels like “Anarchist” or “Prime Minister” can be applied to the same person at different times (e.g. Nelson Mandela). TAXONOMY STRATEGIES LLC The business of organized information 70 Enterprise Portal challenges when organizing content Multiple subject domains across the enterprise Vocabularies vary Granularity varies Unstructured information represents about 80% Information is stored in complex ways Multiple physical locations Many different formats Tagging is time-consuming and requires SME involvement Portal doesn’t solve content access problem Knowledge is power syndrome Incentives to share knowledge don’t exist Free flow of information TO the portal might be inhibited Content silo mentality changes slowly What content has changed? What exists? What has been discontinued? Lack of awareness of other initiatives TAXONOMY STRATEGIES LLC The business of organized information 71 Challenges when organizing content on enterprise portals Lack of content standardization and consistency Content messages vary among departments How do users know which message is correct? Re-usability low to non-existent Costs of content creation, management and delivery may not change when portal is implemented: Similar subjects, BUT Diverse media Diverse tools Different users How will personalization be implemented? How will existing site taxonomies be leveraged? Taxonomy creation may surface “holes” in content TAXONOMY STRATEGIES LLC The business of organized information 72 Agenda 3:30 3:45 4:00 4:30 4:45 5:00 5:15 5:30 6:15 6:30 6:45 Introductions: Us and you Background: Metadata & controlled vocabularies Dublin Core: Elements, issues, and recommendations Dublin Core in the wild: CEN study and remarks Enterprise-wide metadata ROI questions Break ROI (Cont.) Business processes Tools & technologies Q&A Adjourn TAXONOMY STRATEGIES LLC The business of organized information 73 Methods used to create & maintain metadata 80% 71% 70% 57% 60% 50% 43% 43% Centralized production Not Automated 40% 30% 20% 10% 0% Forms Distributed Production Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core – Guidance information for the deployment of Dublin Core metadata in Corporate Environments TAXONOMY STRATEGIES LLC The business of organized information 74 The Tagging Problem How are we going to populate metadata elements with complete and consistent values? What can we expect to get from automatic classifiers? TAXONOMY STRATEGIES LLC The business of organized information 75 Tagging Province of authors (SMEs) or editors? Taxonomy often highly granular to meet task and re-use needs. Vocabulary dependent on originating department. The more tags there are (and the more values for each tag), the more hooks to the content. If there are too many, authors will resist and use “general” tags (if available) Automatic classification tools exist, and are valuable, but results are not as good as humans can do. “Semi-automated” is best. Degree of human involvement is a cost/benefit tradeoff. TAXONOMY STRATEGIES LLC The business of organized information 76 low Content Volumes high Automatic categorization vendors | Analyst viewpoint low TAXONOMY STRATEGIES LLC The business of organized information Accuracy Level high 77 Considerations in automatic classifier performance Accuracy Classification Performance is measured by “Inter-cataloger agreement” Trained librarians agree less than 80% of the time Errors are subtle differences in judgment, or big goofs Trained Librarians potential performance gain Regexps Automatic classification struggles to match human performance Exception: Entity recognition can exceed human performance Development Effort/ Licensing Expense Classifier performance limited by algorithms available, which is limited by development effort Very wide variance in one vendor’s performance depending on who does the implementation, and how much time they have to do it TAXONOMY STRATEGIES LLC The business of organized information 1) 80/20 tradeoff where 20% of effort gives 80% of performance. 2) Smart implementation of inexpensive tools will outperform naive implementations of world-class tools. 78 Tagging tool example: Interwoven MetaTagger Manual form fill-in w/ check boxes, pull-down lists, etc. Auto keyword & summarization TAXONOMY STRATEGIES LLC The business of organized information 79 Tagging tool example: Interwoven MetaTagger Auto-categorization Rules & pattern matching Parse & lookup (recognize names) TAXONOMY STRATEGIES LLC The business of organized information 80 Metadata tagging workflows Even ‘purely’ automatic meta- tagging systems need a manual error correction procedure. Compose in Template Automatically fill-in metadata Submit to CMS Problem? Y Should add a QA sampling mechanism Approve/Edit metadata Tagging models: Author-generated Central librarians Hybrid – central auto-tagging service, distributed manual review and correction Review content N Copy Edit content Har d Cop y Web site Problem? N Y Tagging Tool Analyst Editor Copywriter Sys Admin Sample of ‘author-generated’ metadata workflow. TAXONOMY STRATEGIES LLC The business of organized information 81 low Content Volumes high Automatic categorization vendors | Pragmatic viewpoint low Accuracy Level TAXONOMY STRATEGIES LLC The business of organized information high 82 Seven practical rules for taxonomies 1. Incremental, extensible process that identifies and 2. 3. 4. 5. 6. 7. enables users, and engages stakeholders. Quick implementation that provides measurable results as quickly as possible. Not monolithic—has separately maintainable facets. Re-uses existing IP as much as possible. A means to an end, and not the end in itself . Not perfect, but it does the job it is supposed to do— such as improving search and navigation. Improved over time, and maintained. TAXONOMY STRATEGIES LLC The business of organized information 83 Agenda 3:30 3:45 4:00 4:30 4:45 5:00 5:15 5:30 6:15 6:30 6:45 Introductions: Us and you Background: Metadata & controlled vocabularies Dublin Core: Elements, issues, and recommendations Dublin Core in the wild: CEN study and remarks Enterprise-wide metadata ROI questions Break ROI (Cont.) Business processes Tools & technologies Summary, Q&A Adjourn TAXONOMY STRATEGIES LLC The business of organized information 84 Summary: Categorize with a purpose What is the problem you are trying to solve? Improve search Browse for content on an enterprise-wide portal Enable business users to syndicate content Otherwise provide the basis for content re-use How will you control the cost of creating and maintaining the metadata) needed to solve these problems? CMS with a metadata tagging products Semi-automated classification Taxonomy editing tools Guided navigation tools TAXONOMY STRATEGIES LLC The business of organized information 85 Taxonomy Strategies LLC Contact Info Ron Daniel 925-368-8371 rdaniel@taxonomystrategies.com Joseph Busch 415-377-7912 jbusch@taxonomystrategies.com May 22, 2005 Copyright 2005 Taxonomy Strategies LLC. All rights reserved.