ISO 25964 - Eurovoc thesaurus

advertisement
ISO 25964 - the new standard
for thesauri and interoperability
with other vocabularies
Stella G Dextre Clarke
Project Leader, ISO NP 25964
Overview





What is ISO 25964?
Outline of Part 1
Outline of Part 2
More detail on some of the issues dealt
with in the standard
Comment on the need for a standard
What is ISO 25964?
ISO 25964: Thesauri and interoperability with other
vocabularies
 Part 1: Thesauri for information retrieval
 Part 2: Interoperability with other vocabularies
 It updates ISO 2788 and ISO 5964, with some input
from BS 8723
 Information retrieval (indexing/searching) is the
overall context
 Part 1 covers monolingual and multilingual thesauri
(= ISO 2788 + ISO 5964)
 Part 2 covers mapping between thesauri and other
types of vocabulary
What distinguishes ISO 25964-1
from ISO 2788/5964?







Clearer differentiation between terms and concepts
Clearer guidance on applying facet analysis to
thesauri
Some changes to the ‘rules’ for compound terms
More guidance on managing thesaurus development
and maintenance
Requirements for software to manage thesauri
Data model and XML schema for data exchange
General overhaul in all areas, e.g. sweeping update
of multilingual examples
Is there a need for ISO 25964-1?











“The thesaurus is dead. Long live Google!”
But look how many thesauri we see today – alive and growing
“Nobody has time to do indexing nowadays”
Did anyone ever follow ISO 2788 rigorously?
Look at the lack of standardization in today’s thesauri. The ideal
thesaurus responds to the special needs of its own users.
Consider the demand for networked applications which draw upon
multiple heterogeneous resources
Consider the diversity and evolution of languages/terminology in
today’s full text
Don’t forget the challenge of searching for images without text
Successful automated networking depends on standards, or at least
predictability in the tools and resources
ISO 25964-1 compliance should enhance predictability in search tools
And ISO 25964-2?
Content of ISO 25964-2
“Interoperability with other vocabularies”







No normative statements about building
vocabularies other than thesauri
However, comparisons are made and key
features described.
Emphasis is on interoperability, especially
mapping between different vocabularies
Structural models for mapping
Recommended mapping types
How to handle pre-coordination
Practical aspects of mapping
Which “other vocabularies”?








Classification schemes
Business classification schemes for records
management (aka file plans)
Taxonomies
Subject heading schemes
Ontologies
Terminologies/Term banks
Name authority lists
Synonym rings
Structural models for mapping
across vocabularies
P
A
B
C
D
Q
R
S
F
H
E
G
The dangers of chain mapping
buses → coaches
coaches → trainers
trainers → training shoes
timber → wood
wood → woods
woods → forests
job vacancies → jobs
jobs → posts
posts → post
post → mail
firewood → logs
logs → records
records → archives
Any one of the mappings could be OK in one context, but not
when chained.
Most howlers can be avoided, but only if you check carefully
The dangers of two-way mappings
Poultry
Parrots
Chickens
Canaries
Birds
Ducks
Budgies
Vocabulary 1
Geese
Vocabulary 2
Vocabulary 3
ISO 25964-2 mapping types

Basic mapping types:
Equivalence
Hierarchical
Associative

equivalence mappings can also be
marked as “Exact” or “Inexact”
ISO 25964-2 mapping types
with examples

Basic mapping types:
Equivalence Laptop computers EQ Notebook computers
Hierarchical Roads NM Streets; Streets BM Roads
Associative e-Learning RM Distance education

“Exact” or “Inexact” equivalence
Aubergines =EQ Egg-plants
Horticulture ~EQ Gardening
Subdivisions of ISO 25964-2
mapping types

Basic mapping types:
Equivalence
Simple
Compound
Intersecting compound equivalence
Cumulative compound equivalence
Hierarchical
Broader
Narrower
Associative

“Exact” or “Inexact” applies to simple but not
compound equivalence
Equivalence subdivisions with
examples


Simple Laptop computers
Compound

EQ Notebook computers
Intersecting compound equivalence
Women executives EQ Women + Executives

Cumulative compound equivalence
Inland waterways EQ rivers | canals
Intersecting versus cumulative
equivalence
Women executives EQ Women + Executives
Inland waterways EQ rivers | canals
women
executives
rivers
canals
inland waterways
women executives
Pre-coordination adds
complexity


If only we could ignore classification
schemes and subject heading schemes!
For example:

The UDC class
373.3.016:51
(mathematics curriculum in primary schools)

The LCSH heading
Automobiles--Air conditioning--Maintenance and repair-Periodicals
Example: “academic library labor
unions in Germany”
(- from Marcia Lei Zeng/FRSAD report)
DDC: "331.881102770943“
331.8811 – labor unions in industries and occupations
other than extractive, manufacturing, construction
-027.7 – academic libraries
-0943 – Germany
LCSH:
"Library employees--Labor unions--Germany"
"Universities and colleges--Employees--Labor unions--Germany"
"Collective bargaining--Academic librarians--Germany"
"Libraries and labor unions--Germany"
UNESCO Thesaurus:
“Trade unions” “Academic libraries” “Germany”
ILO Thesaurus:
“Trade union” “library” “educational institution” “Germany”
How to map to and from pre-coordinated
classes and synthesized notations?






For vocabularies using post-coordination (esp thesauri)
mappings between them look feasible
Mapping from a pre-coordinated or synthesized class to a
thesaurus looks feasible.
Mapping to a pre-coordinated class looks more problematic!
The same applies to mapping from a synthesized class in one
scheme to a differently synthesized class in another scheme
Comparing subject headings with classification schemes, precoordination works in slightly different ways. Can we find
common solutions?
In any case, should the aim to be to map between schemes, or
between the indexes of collections indexed/catalogued with the
schemes?
In the real world, mapping
perfection is elusive…









Mapping projects are labour intensive, and often underresourced
Exact equivalence is all too rare
Even when exact equivalence seems likely, it is often hard to be
sure
Some managers assume that mappings can be found by
computers without human guidance
Often the vocabularies to be mapped are poorly constructed
Compound equivalence is needed commonly, but often
unavailable
Inclusion of pre-coordinate schemes makes it much harder
Some systems allow only one mapping per concept
While preparing mappings, you can’t make assumptions about
capabilities of the search software
Is there a need for ISO 25964-2?





Consider the demand for networked applications
which draw upon multiple heterogeneous resources
Finding equivalent concepts cannot rely on
comparison of text words alone
Bear in mind the challenges listed above
Practical experience of mapping is not widespread
ISO 25964-2 provides guidance on good practice,
mostly on the intellectual processes but also on the
potential for automation
Want a copy of ISO 25964-2 ?




A draft is due to appear in early 2011, “ISO
DIS 25964-2”, with the hope of attracting
comments from potential users
The official way to get it is through your
national standards body (e.g. BSI, DIN)
Distribution policies vary from one country to
another; last time round we found a way to
make the draft available online free of charge
and free of passwords, on the BSI site.
Send me an email and I’ll alert you when the
DIS is released. stella@lukehouse.org
Want to get involved?



Contact your national standards body,
specifically the committee corresponding to
ISO TC 46/SC 9/WG8
17 countries already participate: Belgium,
Bulgaria, Canada, China, Denmark, France,
Germany, Finland, Korea, New Zealand, Russia,
South Africa, Spain, Sweden, UK, Ukraine, USA
While Part 1 of the standard will be published in
2011, Part 2 is still in draft. There is time for you
to contribute ideas on interoperability!
Download