Why and How to Use Dublin Core for Enterprise

Taxonomy Strategies LLC
Workshop: Why and How to Use
Dublin Core for Enterprise-Wide
Metadata Applications
Ron Daniel & Joseph Busch
Taxonomy Strategies
May 22, 2005
Copyright 2005 Taxonomy Strategies LLC. All rights reserved.
Workshop goals
1. What is the Dublin Core?
2. Answer these enterprise-wide metadata ROI questions:

What is the value proposition for adding metadata to content?
Does metadata make content reusable? Findable? Improve
productivity? How can metadata value be measured in a way
that quantifies how it contributes to the bottom line?
3. Answer these Business process questions:

How is Dublin Core tagging being done on content to expose
metadata to portals, search engines, and other metadata-aware
applications? How are metadata value spaces (controlled
vocabularies) maintained within an enterprise? Across
enterprises?
4. Answer these technology questions:

What tools exist to use Dublin Core and other metadata
standards in enterprise information management environments?
TAXONOMY STRATEGIES LLC The business of organized information
2
Agenda
3:30
3:45
4:00
4:30
4:45
5:00
5:15
5:30
6:15
6:30
6:45
Introductions: Us and you
Background: Metadata & controlled vocabularies
Dublin Core: Elements, issues, and recommendations
Dublin Core in the wild: CEN study and remarks
Enterprise-wide metadata ROI questions
Break
ROI (Cont.)
Business processes
Tools & technologies
Q&A
Adjourn
TAXONOMY STRATEGIES LLC The business of organized information
3
Who we are: Joseph Busch
Over 25 years in the business of organized information
 Founder, Taxonomy Strategies
 Director, Solutions Architecture, Interwoven
 VP, Infoware, Metacode Technologies (acquired by Interwoven,
November 2000)
 Program Manager, Getty Foundation
 Manager, Pricewaterhouse
Metadata and taxonomies community leadership
 President, American Society for Information Science & Technology
 Director, Dublin Core Metadata Initiative
 Adviser, National Research Council Computer Science and
Telecommunications Board
 Reviewer, National Science Foundation Division of Information and
Intelligent Systems
 Founder, Networked Knowledge Organization Systems/Services
TAXONOMY STRATEGIES LLC The business of organized information
4
Who we are: Ron Daniel, Jr.
Over 15 years in the business of metadata & automatic
classification
 Principal, Taxonomy Strategies
 Standards Architect, Interwoven
 Senior Information Scientist, Metacode Technologies (acquired by
Interwoven, November 2000)
 Technical Staff Member, Los Alamos National Laboratory
Metadata and taxonomies community leadership
 Chair, PRISM (Publishers Requirements for Industry Standard
Metadata) working group
 Acting chair: XML Linking working group
 Member: RDF working groups
 Co-editor: PRISM, XPointer, 3 IETF RFCs, and Dublin Core 1 & 2
reports.
TAXONOMY STRATEGIES LLC The business of organized information
5
Recent & current projects
Government
 Commodity Futures Trading













Commission
Defense Intelligence Agency
ERIC
Federal Aviation Administration
Federal Reserve Bank of Atlanta
Forest Service
GSA Office of Citizen Services
(www.firstgov.gov)
Head Start
Infocomm Development Authority of
Singapore
NASA (nasataxonomy.jpl.nasa.gov)
Small Business Administration
Social Security Administration
USDA Economic Research Service
USDA e-Government Program
(www.usda.gov)
TAXONOMY STRATEGIES LLC The business of organized information
Commercial











Allstate Insurance
Blue Shield of California
Debevoise & Plimpton
Halliburton
Hewlett Packard
Motorola
PeopleSoft
Pricewaterhouse Coopers
Siderean Software
Sprint
Time Inc.
Commercial subcontracts




Agency.com – Top financial services
Critical Mass – Fortune 50 retailer
Deloitte Consulting – Big credit card
Gistics/OTB – Direct selling giant
NGO’s




CEN
IDEAlliance
IMF
OCLC
6
What we do
Organize Stuff
TAXONOMY STRATEGIES LLC The business of organized information
7
Who are you? Tell us:




Your name
Your organization
Your job title
The things you want to get from this workshop
TAXONOMY STRATEGIES LLC The business of organized information
8
Agenda
3:30
3:45
4:00
4:30
4:45
5:00
5:15
5:30
6:15
6:30
6:45
Introductions: Us and you
Background: Metadata & controlled vocabularies
Dublin Core: Elements, issues, and recommendations
Dublin Core in the wild: CEN study and remarks
Enterprise-wide metadata ROI questions
Break
ROI (Cont.)
Business processes
Tools & technologies
Q&A
Adjourn
TAXONOMY STRATEGIES LLC The business of organized information
9
Metadata: Different definitions
 Library & Information
Science
 Author/Title/Subject
 Controlled Vocabularies for
Subject Codes (e.g.
Dewey)
 Authority Files for Author
Names
 Database
 Tables/Columns/
Datatypes/Relationships
 References for some
values
TAXONOMY STRATEGIES LLC The business of organized information
10
Metadata: Why it matters
 “Adding metadata to unstructured content allows it to be managed
like structured content. Applications that use structured content work
better.”
 “Enriching content with structured metadata is critical for supporting
search and personalized content delivery.”
 “Content that has been adequately tagged with metadata can be
leveraged in usage tracking, personalization and improved
searching.”
 “Better structure equals better access: Taxonomy serves as a
framework for organizing the ever-growing and changing information
within a company. The many dimensions of taxonomy can greatly
facilitate Web site design, content management, and search
engineering. If well done, taxonomy will allow for structured Web
content, leading to improved information access.”
TAXONOMY STRATEGIES LLC The business of organized information
11
Metadata: Supports core functions
Complexity
Subject metadataBetter
–
Use metadata –
&When & How:
What, Where &navigation
Why:
Subject, Title, Description,
discovery
Date, Language, Rights
Coverage
Asset metadata –
Who: More efficient
Relational metadata
Links between and to:
Creator, Publisher,editorial
Contributor, Type, Format,
Source, Relation
process
Identifier
–
Enabled Functionality
http://dublincore.org/documents/dces/
TAXONOMY STRATEGIES LLC The business of organized information
12
What is a taxonomy? Systematics view
Hierarchical classification of things into a tree structure
Animalia
Chordata
Mammalia
Carnivora
Canidae
Canis
C. familiari
Kingdom
Phylum
Class
Order
Family
Genus
Species
Linnaeus …
44-Office Equipment and Accessories and
Supplies
.12-Office Supplies
.17-Writing Instruments
.05-Mechanical pencils
.06-Wooden pencils
.07-Colored pencils
Segment
Family
Class
Commodity
UNSPSC …
TAXONOMY STRATEGIES LLC The business of organized information
13
Agenda
3:30
3:45
4:00
4:30
4:45
5:00
5:15
5:30
6:15
6:30
6:45
Introductions: Us and you
Background: Metadata & controlled vocabularies
Dublin Core: Elements, issues, and recommendations
Dublin Core in the wild: CEN study and remarks
Enterprise-wide metadata ROI questions
Break
ROI (Cont.)
Business processes
Tools & technologies
Q&A
Adjourn
TAXONOMY STRATEGIES LLC The business of organized information
14
Dublin Core: A little more complicated
Elements
Refinements
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Abstract
Access rights
Alternative
Audience
Available
Bibliographic citation
Conforms to
Created
Date accepted
Date copyrighted
Date submitted
Education level
Extent
Has format
Has part
Has version
Is format of
Is part of
Identifier
Title
Creator
Contributor
Publisher
Subject
Description
Coverage
Format
Type
Date
Relation
Source
Rights
Language
Encodings Types
Is referenced by
Is replaced by
Is required by
Issued
Is version of
License
Mediator
Medium
Modified
Provenance
References
Replaces
Requires
Rights holder
Spatial
Table of contents
Temporal
Valid
TAXONOMY STRATEGIES LLC The business of organized information
Box
DCMIType
DDC
IMT
ISO3166
ISO639-2
LCC
LCSH
MESH
Period
Point
RFC1766
RFC3066
TGN
UDC
URI
W3CTDF
Collection
Dataset
Event
Image
Interactive
Resource
Moving Image
Physical Object
Service
Software
Sound
Still Image
Text
15
Dublin Core framework for corporate use
 Not just 15 elements
 A framework to enable cross-resource exploration and
use
Dublin Core is framework
for “integration metadata”
at BellSouth
Source: Todd Stephens, BellSouth
TAXONOMY STRATEGIES LLC The business of organized information
16
Metadata: A data specification – a recipe example
Element
Data
Type
Length
Req. /
Repeat
Source
Purpose
Asset Metadata
Unique ID
Integer
Fixed
dc:identifier
1
System supplied
Basic accountability
Recipe Title
dc:title Variable
String
1
Licensed Content
Text search & results display
Recipe summary
dc:description
String
Variable
1
Licensed Content
Content
Main Ingredients
X
List
?
Main Ingredients
vocabulary
Key index to retrieve & aggregate
recipes, & generate shopping list
Variable
Subject Metadata
Meal Types
ListX
Variable
*
Meal Types vocab
Cuisines
ListX
Variable
*
Cuisines
Courses
ListX
Variable
*
Courses vocab
Cooking Method
X
Flag
Fixed
*
Cooking vocab
Browse or group recipes & filter
search results
Link Metadata
Recipe Image
Pointer
Variable
dcterms:hasPart
?
Product Group
Merchandize products
Use Metadata
Rating
String
Variable
Release Date
dc:dateFixed
Date
1
Licensed Content
Filter, rank, & evaluate recipes
1
Product Group
Publish & feature new recipes
dc:type=“recipe”,
dc:format=“text/html”,
Legend:
? – 1 or more * -dc:language=“en”
0 or more
TAXONOMY STRATEGIES LLC The business of organized information
17
Why Dublin Core?
Dublin Core is a de-facto standard Taxonomies, Vocabularies,
Ontologies
across many other systems and
standards
Dublin Core and Similar
 RSS (1.0), OAI
 Inside organizations – portals,
CMS, …
Mapping to DC elements from
most existing schemes is simple
 Beware of force-fits
Why will metadata already exist?
 Because of search projects,
portal integration projects, etc.
that are creating it or
standardizing a mapping.
TAXONOMY STRATEGIES LLC The business of organized information
Source: Todd Stephens, BellSouth
Per-Source Data Types,
Access Controls, etc.
18
Creator
“An entity primarily responsible for
making the content of the
resource”
In other words – Author,
Photographer, Illustrator, …
Refinements
None
Encodings
 Potential refinements by creative role
 Rarely justified
Creators can be persons or
organizations
Key Point – Reminder: Name
variations are a big issue in data
quality:
 Ron Daniel
 Ron Daniel, Jr.
 Ron Daniel Jr.
 R.E. Daniel
None
Name fields may contain other
information
 <dc:creator>Case, W. R. (NASA
Goddard Space Flight Center,
Greenbelt, MD, United
States)</dc:creator>
Best practice – Validate names against
LDAP or other “Authority File”
 Ronald Daniel
 Ronald Ellison Daniel, Jr.
 Daniel, R.
TAXONOMY STRATEGIES LLC The business of organized information
19
Example – Name mismatches
One of these things is not like the other:
 Ron Daniel, Jr. and Carl Lagoze; “Distributed Active
Relationships in the Warwick Framework”
 Hojung Cha and Ron Daniel; “Simulated Behavior of
Large Scale SCI Rings and Tori”
 Ron Daniel; “High Performance Haptic and Teleoperative
Interfaces”
Differences may not matter
If they do
 This error cannot be reliably detected automatically
 Authority files and an error-correction procedure are
needed
TAXONOMY STRATEGIES LLC The business of organized information
20
Contributor
“An entity responsible for making
contributions to the content of the
resource.”
Refinements
None
Encodings
In practice – rarely used.
 Difficult to distinguish from
Creator.
 Adds UI Complexity for no real
gain
None
Best Practice?
Recommendation – Don’t use.
TAXONOMY STRATEGIES LLC The business of organized information
21
Publisher
“An entity responsible for making
the resource available”.
Refinements
None
Problems:
Encodings
 All the name-handling stuff of
Creator.
 Hierarchy of publishers (Bureau,
Agency, Department, …)
TAXONOMY STRATEGIES LLC The business of organized information
None
22
Title
Refinements
“A name given to the resource”.
Issues:
Alternative
 Hierarchical Titles
e.g. Conceptual Structures:
Information Processing in Mind
and Machine (The Systems
Programming Series)
Encodings
None
 Untitled Works
 Metaphysics
TAXONOMY STRATEGIES LLC The business of organized information
23
Identifier
Refinements
“An unambiguous reference to the
resource within a given context”
Bibliographic Citation
Best Practice: URL
Encodings
Future Best Practice: URI?
Problems
URI
 Metaphysics
 Personalized URLs
 Multiple identifiers for same
content
 Non-standard resolution
mechanisms for URIs
Recommendations – Plan how to
introduce long-lived URLs
TAXONOMY STRATEGIES LLC The business of organized information
24
Date
“A date associated with an event in
the life cycle of the resource”
Woefully underspecified.
Typically the publication or last
modification date.
Best practice: YYYY-MM-DD
Refinements
Created
Valid
Available
Issued
Modified
Date Accepted
Date Copyrighted
Date Submitted
Encodings
DCMI Period
W3C DTF (Profile of ISO 8601)
TAXONOMY STRATEGIES LLC The business of organized information
25
Subject
Refinements
The topic of the content of the
resource.
Best practice: Use pre-defined
subject schemes, not userselected keywords.
None
Encodings
 Supported Encodings probably not
useful for most corporate needs
Factor “Subject” into separate
facets.
 People, places, organizations, events,
objects, services
 Industry sectors
 Content types, audiences, functions
 Topic
DDC
LCC
LCSH
MESH
UDC
Some of the facets are already
defined in DC (Coverage, Type)
or DCTERMS (Audience)
TAXONOMY STRATEGIES LLC The business of organized information
26
Coverage
“The extent or scope of the content
of the resource”.
In other words – places and times
as topics.
Key Point – Locations important in
SOME environments, irrelevant in
others. Time periods as subjects
rarely important in commercial
work.
Refinements
Spatial
Temporal
Encodings
Box (for Spatial)
ISO3166 (for Spatial)
Point (for Spatial)
TGN (for Spatial)
W3CTDF (for Temporal)
Best Practice – ISO 3166-1, 3166-2
TAXONOMY STRATEGIES LLC The business of organized information
27
Description
“An account of the content of the
resource”.
In other words – an abstract or
summary
Key Point – What’s the cost/benefit
tradeoff for creating descriptions?
Refinements
Abstract
Table of Contents
Encodings
None
Quality of auto-generated
descriptions is low
For search results, hit highlighting
is probably better
TAXONOMY STRATEGIES LLC The business of organized information
28
Type
“The nature or genre of the content
of the resource”
Best Current Practice: Create a
custom list of content types, use
that list for the values.
Try to avoid “image”, “audio”, and
Refinements
None
Encodings
DCMI Type
other format names in the list of
content types, they can be derived
from “Format”.
No broadly-acceptable list yet
found.
TAXONOMY STRATEGIES LLC The business of organized information
29
Format
“The physical or digital
manifestation of the resource.”
In other words – the file format
Refinements
Extent
Medium
Best practice: Internet Media Types
Outliers: File sizes, dimensions of
physical objects
TAXONOMY STRATEGIES LLC The business of organized information
Encodings
IMT
30
Language
“A language of the intellectual
content of the resource”.
Refinements
None
Best Practice: ISO 639, RFC 3066
Dialect codes: Advanced practice
TAXONOMY STRATEGIES LLC The business of organized information
Encodings
ISO639-2
RFC1766
RFC3066
31
Relation
Refinements
“A reference to a related resource”
Very weak meaning – not even as
strong as “See also”.
Best practice: Use a refinement
element and URLs.
Is Version Of
Has Version
Is Replaced By
Replaces
Is Required By
Requires
Is Part Of
Has Part
Is Referenced By
References
Is Format Of
Has Format
Conforms To
Encodings
URI
TAXONOMY STRATEGIES LLC The business of organized information
32
Source
“A reference to a resource from
which the present resource is
derived”
Original intent was for derivative
works
Refinements
None
Encodings
URI
Frequently abused to provide
bibliographic information for items
extracted from a larger work, such
as articles from a Journal
TAXONOMY STRATEGIES LLC The business of organized information
33
Rights
“Information about rights held in and
over the resource”
Could be a copyright statement, or
a list of groups with access rights,
or …
Refinements
Access Rights
License
Encodings
None
TAXONOMY STRATEGIES LLC The business of organized information
34
Agenda
3:30
3:45
4:00
4:30
4:45
5:00
5:15
5:30
6:15
6:30
6:45
Introductions: Us and you
Background: Metadata & controlled vocabularies
Dublin Core: Elements, issues, and recommendations
Dublin Core in the wild: CEN study and remarks
Enterprise-wide metadata ROI questions
Break
ROI (Cont.)
Business processes
Tools & technologies
Q&A
Adjourn
TAXONOMY STRATEGIES LLC The business of organized information
35
Taxonomy Strategies LLC
CEN/ISSS Workshop on Dublin
Core. Guidance information for
the deployment of Dublin Core
metadata in Corporate
Environments
http://www.cenorm.be/cenorm/businessdomains
/businessdomains/isss/cwa/cwa15247.asp
May 22, 2005
Copyright 2005 Taxonomy Strategies LLC. All rights reserved.
Dublin Core: CEN/ISSS Workshop on Dublin Core
Metadata – corporate uses
 Applied Information








Technique
AstraZenica
BBC
BellSouth
Cisco
Daimler Chrysler
Giunti Labs
GSK
Halliburton
TAXONOMY STRATEGIES LLC The business of organized information










HP
IBM
Intel
John Wiley & Sons
Lilly
PeopleSoft
Rohm Haas
SAP
Software AG
Unisys
37
How is Dublin Core used in corporate
environments?
60%
57%
50%
43%
43%
40%
29%
30%
20%
10%
0%
De facto
Simple
Base: 20 corporate information managers
Access enabler
Compliance
CEN/ISSS Workshop on Dublin Core
– Guidance information for the deployment of Dublin
Core metadata in Corporate Environments
TAXONOMY STRATEGIES LLC The business of organized information
38
Taxonomy: e-Forms example
Agency
0001 Legislative
1000 Judicial
1100 Executive
Office of Pres
0003 Exec Depts
1200 Agriculture
1300 Commerce
9700 Defense
9100 Education
8900 Energy
7500 HHS
7000 DHS
8600 HUD
1400 Interior
1500 Justice
1600 Labor
1900 State
6900 Transport
2000 Treasury
3600 Veterans
Ind Agencies
Intl Orgs
Form Type
Industry
Impact
Jurisdiction
Application
Approval
Claim
Information
request
Information
submission
Instructions
Legal filing
Payment
Procurement
Renewal
Reservation
Service
request
Test
Other input
Other
transaction
00 Generic
11 Agriculture
21 Mining
22 Utilities
23 Construct
31-33 Manuf
42 Wholesale
44-45 Retail
48-49 Trans
51 Info
52 Finance
54 Profession
55 Mgmt
56 Support
61 Education
62 Health
Care
71 Arts
72 Hospitality
81 Other
Services
92 Public
Admin
Facets
Federal
State +
Local +
Other +
BRM Impact
Keyword
Topic
Citizen Srvcs
Social Srvs
Defense
Disasters
Econ Dev
Education
Energy
Env Mgmt
Law Enf
Judicial
Correctional
Health
Security
Income Sec
Intelligence
Intl Affairs
Nat Resour
Transport
Workforce
Science
Delivery
Support
Management
Agriculture &
food
Commerce
Communications
Education
Energy
Env pro
Foreign rels
Govt
Health &
safety
Housing &
comm dev
Labor
Law
Named grps
National def
Nat resources
Recreation
Sci & tech
Social pgms
Transport
Audience
All
General
Citizen
Business
Govt
Employee
Native
American
Nonresident
Tourist
Special
group
Controlled Vocabularies
TAXONOMY STRATEGIES LLC The business of organized information
39
How Dublin Core is extended?
120%
100%
100%
86%
80%
60%
57%
57%
Roles
Inconsistent
Encoding
40%
20%
0%
Doc Types
Products &
Services
Base: 20 corporate information managers
CEN/ISSS Workshop on Dublin Core
– Guidance information for the deployment of Dublin
Core metadata in Corporate Environments
TAXONOMY STRATEGIES LLC The business of organized information
40
Custom business process document types? Ouch!
Oil & gas services company document types
analysis, appraisals, assessments, forecasts, predictions
agendas, plans, designs, schedules, workflow
applications, proposals, requests, requirements
permits, consents, approvals, rejections, certificates
work orders, correspondence
auditing, compliance, testing, inspections, operations reports
lessons learned, after-action reviews, meeting minutes, FAQs
policies, procedures, training manuals, standards, best practices
research notes, journal articles
newsletters, bulletins, press releases
ads, brochures, data sheets, technical notes, case studies, price lists
checklists, templates, forms, logos, branding
software, database forms
TAXONOMY STRATEGIES LLC The business of organized information
41
The power of taxonomy facets
 4 independent categories
of 10 nodes each have
the same discriminatory
power as one hierarchy
of 10,000 nodes (104)
 Easier to maintain
 Can be easier to
navigate
TAXONOMY STRATEGIES LLC The business of organized information
42
Taxonomic metadata example:
Form SS-4. Employer Identification Number (EIN)
Facet
Values
Agency
IRS
Content Type
Information Submission
Industry
Impact
Generic
Jurisdiction
Federal
Programs &
Services
Support Delivery of Services/General
Government/Taxation Management
Keyword Topic Commerce/Employment taxes
Audience
Business
TAXONOMY STRATEGIES LLC The business of organized information
43
Agenda
3:30
3:45
4:00
4:30
4:45
5:00
5:15
5:30
6:15
6:30
6:45
Introductions: Us and you
Background: Metadata & controlled vocabularies
Dublin Core: Elements, issues, and recommendations
Dublin Core in the wild: CEN study and remarks
Enterprise-wide metadata ROI questions
Break
ROI (Cont.)
Business processes
Tools & technologies
Q&A
Adjourn
TAXONOMY STRATEGIES LLC The business of organized information
44
Fundamentals of metadata ROI
 Tagging content using metadata and a taxonomy are
costs, not benefits.
 There is no benefit without exposing the tagged content
to users in some way that cuts costs or improves
revenues.
 Putting metadata and a taxonomy into operation requires
UI changes and/or backend system changes, as well as
data changes.
 You need to determine those changes, and their costs, as
part of the ROI.
TAXONOMY STRATEGIES LLC The business of organized information
45
Common metadata ROI scenarios
 Catalog site
 Increased sales.
 Increased productivity.
 Customer support
 Cutting costs.
 Increased sales.
 Compliance
 Avoiding penalties.
 Knowledge worker productivity
 Less time searching, more time working.
 Executive Mandate
 No ROI study, just someone with a vision and a budget.
TAXONOMY STRATEGIES LLC The business of organized information
46
Metadata ROI: Catalog site
Guided Navigation
 2-3 clicks to product
 No dead ends
http://www.tesco.com/winestore
TAXONOMY STRATEGIES LLC The business of organized information
47
Metadata ROI: Catalog site
 Increased sales
 Enterprise portal cost
 Product findability.
 Product cross-sells and up-
 $6M
sells.
 Customer loyalty.
 1-5% increase in sales
 $57.6B sales (’04)
 $2.1B net income (’04)


$600M to $2B/year
$21M to $105M/year

 $50K average cost per employee
 310,400 employees (’04)
$155M to $776M/year
 1-5% increase in productivity
TAXONOMY STRATEGIES LLC The business of organized information
48
Metadata ROI: Customer support model
Help on search
page, not a click
away.
Type and go to
search for specific
policies
Policy categories
for browsing
Refine search
offered with
results
Good search
results for policy
topics, e.g.,
“pets”
TAXONOMY STRATEGIES LLC The business of organized information
49
Metadata ROI: Customer support model
 Self service
 Manual processing
 Fewer customer calls.
 Faster, more accurate CSR




responses through better
information access.
100,000 documents
2 pages per document
$4 per page
$800K
 25-50% service efficiency
increase
 300K customer service calls
per month
 $6 cost per call

$5.4M to $10.8M/yr


$186M to $930M/year
($575M) to $169M/year
 1-5% increased sales
 $18.6B sales (’04)
 ($761M) net income (’04)
TAXONOMY STRATEGIES LLC The business of organized information
50
Metadata ROI: Compliance
 Avoiding penalties for
breaching regulations
 SOX: up to 5 years in jail
 SOX: up to $5M
 Following required
procedures
 Loss of company
 $100B revenue (’00)

$100B
 Loss of partner companies
 Arthur Andersen
TAXONOMY STRATEGIES LLC The business of organized information
51
Knowledge workers spend up to 2.5 hours
each day looking for information …
Communicating
Searching
Creating
… But find what they are looking for only 40% of
the time.
— Kit Sims Taylor
TAXONOMY STRATEGIES LLC The business of organized information
52
High cost of not finding information
 “The amount of time wasted in futile searching for vital
information is enormous, leading to staggering costs …”
— Sue Feldman,
High cost of poor classification
 Poor classification costs a 10,000 user organization $10M
each year—about $1,000 per employee.
— Jakob Nielsen, useit.com
But “better search” itself is a weak ROI
TAXONOMY STRATEGIES LLC The business of organized information
53
Knowledge workers spend more time re-creating
existing content than creating new content
Communicating
Recreating
existing
content
26%
Searching
Creating
new
content
9%
— Kit Sims Taylor
TAXONOMY STRATEGIES LLC The business of organized information
54
Metadata ROI: Productivity
 Decreased cost to market
 Decreased development
cost
 Increased R&D productivity
 Reduced time for sales &
marketing
 Enterprise document
management system cost
 $10M
 1-5% decrease in drug
development cost
 $800M/drug
 5-10% increase in R&D

$8M to $16M/drug

$254M to $507M/year

$254M to $507M/year
productivity
 13% of revenue
 $39B in sales (’04)
 10-20% decrease in time
for sales & marketing
 13% of revenue
TAXONOMY STRATEGIES LLC The business of organized information
55
Metadata FAQ: Executive mandate is key
 There is no ROI out of the box
 Just someone with a vision
…and the budget to make it happen.
 What’s really needed?
 Demos and proofs of value.
 So that a stronger cost benefit argument can be made for
continuing the work
TAXONOMY STRATEGIES LLC The business of organized information
56
Metadata FAQ: How do you sell it?
 Don’t sell “metadata” or “taxonomy”, sell the vision of
what you want to be able to do.
 Clearly understand what the problem is and what the
opportunities are.
 Do the calculus (costs and benefits)
 Design the taxonomy (in terms of LOE) in relation to the
value at hand.
TAXONOMY STRATEGIES LLC The business of organized information
57
Agenda
3:30
3:45
4:00
4:30
4:45
5:00
5:15
5:30
6:15
6:30
6:45
Introductions: Us and you
Background: Metadata & controlled vocabularies
Dublin Core: Elements, issues, and recommendations
Dublin Core in the wild: CEN study and remarks
Enterprise-wide metadata ROI questions
Break
ROI (Cont.)
Business processes
Tools & technologies
Q&A
Adjourn
TAXONOMY STRATEGIES LLC The business of organized information
58
Overview of metadata practices




Identify the team
Use (or map to) Dublin Core for basic information.
Extend with custom elements for specific facts.
Use pre-existing, standard, vocabularies as much as
possible.
 ISO country codes for locations
 Product & service info from ERP system
 Validate author names with LDAP directory
 Design a QC Process
 Start with an error-correction process, then get more formal on
error detection
 Large-scale ontologies may be valuable in automated error
detection
TAXONOMY STRATEGIES LLC The business of organized information
59
Factor “Subject” into smaller facets
 Size
 DMOZ tries to organize all
web content, has more than
600k categories!
 Difficulty in navigating,
maintaining
 Hidden facet structure
 “Classification Schemes” vs.
“Taxonomies”
TAXONOMY STRATEGIES LLC The business of organized information
60
Sources for 7 common vocabularies
dc:publisherVocabulary
Definition
Potential Sources
Organization
Organizational structure.
FIPS 95-2, U.S. Government Manual,
Your organizational structure, etc.
Content Type
Structured list of the various types
of content being managed or used.
DC Types, AGLS Document Type, AAT
Information Forms , Records
management policy, etc.
Broad market categories such as
lines of business, life events, or
industry codes.
FIPS 66, SIC, NAICS, etc.
Location
Place of operations or
constituencies.
FIPS 5-2, FIPS 55-3, ISO 3166, UN
Statistics Div, US Postal Service, etc.
Function
Functions and processes
performed to accomplish mission
and goals.
FEA Business Reference Model,
Enterprise Ontology, AAT Functions, etc.
Topic
Business topics relevant to your
mission and goals.
Federal Register Thesaurus, NAL
Agricultural Thesaurus, LCSH, etc.
Audience
Subset of constituents to whom a
piece of content is directed or
intended to be used.
GEM, ERIC Thesaurus, IEEE LOM, etc.
Names of products/programs &
services.
ERP system, Your products and
services, etc.
dc:type
Industry
dc:coverage
dc:subject
dcterms:audience
Products and
Services
TAXONOMY STRATEGIES LLC The business of organized information
61
Cheap and Easy Metadata
 Some fields will be constant across a collection.
 In the context of a single collection those kinds of
elements add no value, but they add tremendous value
when many collections are brought together into one
place, and they are cheap to create and validate.
TAXONOMY STRATEGIES LLC The business of organized information
62
Taxonomy Business Processes
• Taxonomies must change, gradually, over time if they
are to remain relevant
• Maintenance processes need to be specified so that
the changes are based on rational cost/benefit
decisions
• A team will need to maintain the taxonomy on a parttime basis
• Taxonomy team reports to some other steering
committee
TAXONOMY STRATEGIES LLC The business of organized information
63
Definitions about the Controlled Vocabulary
Governance Environment
1: Syndicated
Terminologies
change on their
own schedule
Syndicated
Terminologies
ISO
3166-1
Other
External
Change Requests
& Responses
Published
CVs and STs
Web CMS
2: CV Team
decides when
to update CVs
Archives
Intranet
Search
Vocabulary
Management
System
ERMS
’
Notifications
CVs
ERP
3: Team adds value via
mappings, translations,
synonyms, training
materials, etc.
Custodians
Other
Internal
Consuming
Applications
Other
Controlled
Items
Intranet
Nav.
DAM
…
4: Updated versions
of CVs published
to consuming
applications
…
’
Controlled Vocabulary Governance
Environment
TAXONOMY STRATEGIES LLC The business of organized information
64
Other Controlled Items
Taxonomy Team will have additional items to manage:





Charter, Goals, Performance Measures
Editorial rules
Team processes
Tagger training materials (manual and automatic)
Outreach & ROI




Communication plan
Website
Presentations
Announcements
 Roadmap
TAXONOMY STRATEGIES LLC The business of organized information
65
Taxonomy governance | Generic team charter
Taxonomy Team is responsible for maintaining:
 The Taxonomy, a multi-faceted classification scheme
 Associated taxonomy materials, such as:




Editorial Style Guide
Taxonomy Training Materials
Metadata Standard
Team rules and procedures (subject to CIO review)
Team evaluates costs and benefits of suggested change
Taxonomy Team will:
 Manage relationship between providers of source
vocabularies and consumers of the Taxonomy
 Identify new opportunities for use of the Taxonomy across
the Enterprise to improve information management
practices
 Promote awareness and use of the Taxonomy
TAXONOMY STRATEGIES LLC The business of organized information
66
Other Controlled Items - Editorial Rules
To ensure consistent style, rules are needed
Issues commonly addressed in the rules:
 Sources of Terms
 Abbreviations
 Ampersands
 Capitalization
 Continuations (More… or Other…)
 Duplicate Terms
 Hierarchy and Polyhierarchy
 Languages and Character Sets
 Length Limits
 “Other” – Allowed or Forbidden?
 Plural vs. Singular Forms
 Relation Types and Limits
 Scope Notes
 Serial Comma
 Spaces
 Synonyms and Acronyms
 Term Arrangement (Alphabetic or …)
 Term Label Order (Direct vs. Inverted)
Must also address issue of what to do when
rules conflict – which are more important?
TAXONOMY STRATEGIES LLC The business of organized information
Rule Name
Editorial Rule
Use Existing
Vocabularies
Other things being equal, reusing an existing
vocabulary is preferred to creating a new
one.
Ampersands
The character '&' is preferred to the word
‘and’ in Term Labels.
Example: Use Type: “Manuals & Forms”, not
“Manuals and Forms”.
Special Characters
Retain accented characters in Term Labels.
Example: España
Serial comma
If a category name includes more than two
items, separate the items by commas. The
last item is separated by the character ‘&’
which IS NOT preceded by a comma.
Example: “Education, Learning &
Employment”, not “Education, Learning, &
Employment”.
Capitalization
Use title case (where all words except
articles are capitalized).
Example: “Education, Learning &
Employment”
NOT “Education, learning & employment”
NOT “EDUCATION, LEARNING &
EMPLOYMENT”
NOT “education, learning & employment”
…
…
67
Roles in Two Taxonomy Governance Teams
Executive Sponsor



Advocate for the taxonomy team
Business Lead

Taxonomy Specialist
Keeps team on track with larger business
objectives
Balances cost/benefit issues to decide
appropriate levels of effort

Content Owner

Obtains needed resources if those in team
can’t accomplish a particular task
Technical Specialist


Estimates costs of proposed changes in
terms of amount of data to be retagged,
additional storage and processing burden,
software changes, etc.
Helps obtain data from various systems
Content Specialist



Team’s liaison to content creators
Estimates costs of proposed changes in
terms of editorial process changes, additional
or reduced workload, etc.
Small-scale Metadata QA Responsibility
TAXONOMY STRATEGIES LLC The business of organized information
Reality check on process change suggestions
Team structure at a different org.
 Specialists help in estimating costs

Suggests potential taxonomy changes based on
analysis of query logs, indexer feedback
Makes edits to taxonomy, installs into system
with aid of IT specialist
Business Lead
Custodians

Responsible for content in a specific CV.
Training Representative

Develops communications plan, training
materials
Work Practices Representative

Develops processes, monitors adherence
IT Representative

Backups, admin of CV Tool
Info. Mgmt. Representative

Provides CV expertise, tie-in with larger IM effort
in the organization.
68
Taxonomy governance | Where changes come from
Firewall
Application
UI
Tagging
UI
Content
Application
Logic
Tagging
Logic
Taxonomy
Staff
notes
‘missing’
concepts
Query log
analysis
End User
Recommendations by Editor
1. Small taxonomy changes
(labels, synonyms)
2. Large taxonomy changes
(retagging, application
changes)
3. New “best bets” content
Tagging Staff
Taxonomy Editor
Taxonomy Team
TAXONOMY STRATEGIES LLC The business of organized information
Team considerations
1. Business goals
2.
experience
Changes in user
experience
3. Retagging cost
Requests from other
Requests
from
other
parts of
NASA
parts of the organization
69
Principles
 Basic facets with identified items – people, places,
projects, instruments, missions, organizations, … Note
that these are not subjective “subjects”, they are objective
“objects”.
 Clearly identify the Custodians of the facets, and the
process for maintain and publishing them.
 Subjective views can be laid on top of the objective facts,
but should be in a different namespace so they are clearly
distinguishable.
 For example, labels like “Anarchist” or “Prime Minister” can be
applied to the same person at different times (e.g. Nelson
Mandela).
TAXONOMY STRATEGIES LLC The business of organized information
70
Enterprise Portal challenges when organizing
content
 Multiple subject domains across the enterprise
 Vocabularies vary
 Granularity varies
 Unstructured information represents about 80%
 Information is stored in complex ways
 Multiple physical locations
 Many different formats
 Tagging is time-consuming and requires SME involvement
 Portal doesn’t solve content access problem
 Knowledge is power syndrome
 Incentives to share knowledge don’t exist
 Free flow of information TO the portal might be inhibited
 Content silo mentality changes slowly
 What content has changed?
 What exists?
 What has been discontinued?
 Lack of awareness of other initiatives
TAXONOMY STRATEGIES LLC The business of organized information
71
Challenges when organizing content on enterprise
portals
 Lack of content standardization and consistency
 Content messages vary among departments
 How do users know which message is correct?
 Re-usability low to non-existent
 Costs of content creation, management and delivery may
not change when portal is implemented:




Similar subjects, BUT
Diverse media
Diverse tools
Different users
 How will personalization be implemented?
 How will existing site taxonomies be leveraged?
 Taxonomy creation may surface “holes” in content
TAXONOMY STRATEGIES LLC The business of organized information
72
Agenda
3:30
3:45
4:00
4:30
4:45
5:00
5:15
5:30
6:15
6:30
6:45
Introductions: Us and you
Background: Metadata & controlled vocabularies
Dublin Core: Elements, issues, and recommendations
Dublin Core in the wild: CEN study and remarks
Enterprise-wide metadata ROI questions
Break
ROI (Cont.)
Business processes
Tools & technologies
Q&A
Adjourn
TAXONOMY STRATEGIES LLC The business of organized information
73
Methods used to create & maintain metadata
80%
71%
70%
57%
60%
50%
43%
43%
Centralized
production
Not Automated
40%
30%
20%
10%
0%
Forms
Distributed
Production
Base: 20 corporate information managers
CEN/ISSS Workshop on Dublin Core
– Guidance information for the deployment of Dublin
Core metadata in Corporate Environments
TAXONOMY STRATEGIES LLC The business of organized information
74
The Tagging Problem
 How are we going to populate metadata elements with
complete and consistent values?
 What can we expect to get from automatic classifiers?
TAXONOMY STRATEGIES LLC The business of organized information
75
Tagging
 Province of authors (SMEs) or editors?
 Taxonomy often highly granular to meet task and re-use




needs.
Vocabulary dependent on originating department.
The more tags there are (and the more values for each
tag), the more hooks to the content.
If there are too many, authors will resist and use “general”
tags (if available)
Automatic classification tools exist, and are valuable, but
results are not as good as humans can do.
 “Semi-automated” is best.
 Degree of human involvement is a cost/benefit tradeoff.
TAXONOMY STRATEGIES LLC The business of organized information
76
low
Content Volumes
high
Automatic categorization vendors | Analyst
viewpoint
low
TAXONOMY STRATEGIES LLC The business of organized information
Accuracy Level
high
77
Considerations in automatic classifier performance
Accuracy
 Classification Performance is
measured by “Inter-cataloger
agreement”
 Trained librarians agree less
than 80% of the time
 Errors are subtle differences in
judgment, or big goofs
Trained Librarians
potential
performance
gain
Regexps
 Automatic classification struggles
to match human performance
 Exception: Entity recognition can
exceed human performance
Development Effort/
Licensing Expense
 Classifier performance limited by
algorithms available, which is
limited by development effort
 Very wide variance in one vendor’s
performance depending on who
does the implementation, and how
much time they have to do it
TAXONOMY STRATEGIES LLC The business of organized information
1) 80/20 tradeoff where 20% of effort
gives 80% of performance.
2) Smart implementation of inexpensive
tools will outperform naive
implementations of world-class tools.
78
Tagging tool example: Interwoven MetaTagger
Manual form fill-in w/ check
boxes, pull-down lists, etc.
Auto keyword &
summarization
TAXONOMY STRATEGIES LLC The business of organized information
79
Tagging tool example: Interwoven MetaTagger
Auto-categorization
Rules & pattern
matching
Parse & lookup
(recognize names)
TAXONOMY STRATEGIES LLC The business of organized information
80
Metadata tagging workflows
 Even ‘purely’ automatic meta-
tagging systems need a manual
error correction procedure.
Compose in
Template
Automatically
fill-in metadata
Submit to CMS
Problem?
Y
 Should add a QA sampling
mechanism
Approve/Edit
metadata
 Tagging models:
 Author-generated
 Central librarians
 Hybrid – central auto-tagging
service, distributed manual
review and correction
Review
content
N
Copy Edit
content
Har
d
Cop
y
Web
site
Problem?
N
Y
Tagging Tool
Analyst
Editor
Copywriter
Sys Admin
Sample of ‘author-generated’ metadata
workflow.
TAXONOMY STRATEGIES LLC The business of organized information
81
low
Content Volumes
high
Automatic categorization vendors | Pragmatic
viewpoint
low
Accuracy Level
TAXONOMY STRATEGIES LLC The business of organized information
high
82
Seven practical rules for taxonomies
1. Incremental, extensible process that identifies and
2.
3.
4.
5.
6.
7.
enables users, and engages stakeholders.
Quick implementation that provides measurable results
as quickly as possible.
Not monolithic—has separately maintainable facets.
Re-uses existing IP as much as possible.
A means to an end, and not the end in itself .
Not perfect, but it does the job it is supposed to do—
such as improving search and navigation.
Improved over time, and maintained.
TAXONOMY STRATEGIES LLC The business of organized information
83
Agenda
3:30
3:45
4:00
4:30
4:45
5:00
5:15
5:30
6:15
6:30
6:45
Introductions: Us and you
Background: Metadata & controlled vocabularies
Dublin Core: Elements, issues, and recommendations
Dublin Core in the wild: CEN study and remarks
Enterprise-wide metadata ROI questions
Break
ROI (Cont.)
Business processes
Tools & technologies
Summary, Q&A
Adjourn
TAXONOMY STRATEGIES LLC The business of organized information
84
Summary: Categorize with a purpose
 What is the problem you are trying to solve?




Improve search
Browse for content on an enterprise-wide portal
Enable business users to syndicate content
Otherwise provide the basis for content re-use
 How will you control the cost of creating and maintaining
the metadata) needed to solve these problems?




CMS with a metadata tagging products
Semi-automated classification
Taxonomy editing tools
Guided navigation tools
TAXONOMY STRATEGIES LLC The business of organized information
85
Taxonomy Strategies LLC
Contact Info
Ron Daniel
925-368-8371
rdaniel@taxonomystrategies.com
Joseph Busch
415-377-7912
jbusch@taxonomystrategies.com
May 22, 2005
Copyright 2005 Taxonomy Strategies LLC. All rights reserved.