ISO 25964 - the new standard for thesauri and interoperability with other vocabularies Stella G Dextre Clarke Project Leader, ISO NP 25964 Overview What is ISO 25964? Outline of Part 1 Outline of Part 2 More detail on some of the issues dealt with in the standard Comment on the need for a standard What is ISO 25964? ISO 25964: Thesauri and interoperability with other vocabularies Part 1: Thesauri for information retrieval Part 2: Interoperability with other vocabularies It updates ISO 2788 and ISO 5964, with some input from BS 8723 Information retrieval (indexing/searching) is the overall context Part 1 covers monolingual and multilingual thesauri (= ISO 2788 + ISO 5964) Part 2 covers mapping between thesauri and other types of vocabulary What distinguishes ISO 25964-1 from ISO 2788/5964? Clearer differentiation between terms and concepts Clearer guidance on applying facet analysis to thesauri Some changes to the ‘rules’ for compound terms More guidance on managing thesaurus development and maintenance Requirements for software to manage thesauri Data model and XML schema for data exchange General overhaul in all areas, e.g. sweeping update of multilingual examples Is there a need for ISO 25964-1? “The thesaurus is dead. Long live Google!” But look how many thesauri we see today – alive and growing “Nobody has time to do indexing nowadays” Did anyone ever follow ISO 2788 rigorously? Look at the lack of standardization in today’s thesauri. The ideal thesaurus responds to the special needs of its own users. Consider the demand for networked applications which draw upon multiple heterogeneous resources Consider the diversity and evolution of languages/terminology in today’s full text Don’t forget the challenge of searching for images without text Successful automated networking depends on standards, or at least predictability in the tools and resources ISO 25964-1 compliance should enhance predictability in search tools And ISO 25964-2? Content of ISO 25964-2 “Interoperability with other vocabularies” No normative statements about building vocabularies other than thesauri However, comparisons are made and key features described. Emphasis is on interoperability, especially mapping between different vocabularies Structural models for mapping Recommended mapping types How to handle pre-coordination Practical aspects of mapping Which “other vocabularies”? Classification schemes Business classification schemes for records management (aka file plans) Taxonomies Subject heading schemes Ontologies Terminologies/Term banks Name authority lists Synonym rings Structural models for mapping across vocabularies P A B C D Q R S F H E G The dangers of chain mapping buses → coaches coaches → trainers trainers → training shoes timber → wood wood → woods woods → forests job vacancies → jobs jobs → posts posts → post post → mail firewood → logs logs → records records → archives Any one of the mappings could be OK in one context, but not when chained. Most howlers can be avoided, but only if you check carefully The dangers of two-way mappings Poultry Parrots Chickens Canaries Birds Ducks Budgies Vocabulary 1 Geese Vocabulary 2 Vocabulary 3 ISO 25964-2 mapping types Basic mapping types: Equivalence Hierarchical Associative equivalence mappings can also be marked as “Exact” or “Inexact” ISO 25964-2 mapping types with examples Basic mapping types: Equivalence Laptop computers EQ Notebook computers Hierarchical Roads NM Streets; Streets BM Roads Associative e-Learning RM Distance education “Exact” or “Inexact” equivalence Aubergines =EQ Egg-plants Horticulture ~EQ Gardening Subdivisions of ISO 25964-2 mapping types Basic mapping types: Equivalence Simple Compound Intersecting compound equivalence Cumulative compound equivalence Hierarchical Broader Narrower Associative “Exact” or “Inexact” applies to simple but not compound equivalence Equivalence subdivisions with examples Simple Laptop computers Compound EQ Notebook computers Intersecting compound equivalence Women executives EQ Women + Executives Cumulative compound equivalence Inland waterways EQ rivers | canals Intersecting versus cumulative equivalence Women executives EQ Women + Executives Inland waterways EQ rivers | canals women executives rivers canals inland waterways women executives Pre-coordination adds complexity If only we could ignore classification schemes and subject heading schemes! For example: The UDC class 373.3.016:51 (mathematics curriculum in primary schools) The LCSH heading Automobiles--Air conditioning--Maintenance and repair-Periodicals Example: “academic library labor unions in Germany” (- from Marcia Lei Zeng/FRSAD report) DDC: "331.881102770943“ 331.8811 – labor unions in industries and occupations other than extractive, manufacturing, construction -027.7 – academic libraries -0943 – Germany LCSH: "Library employees--Labor unions--Germany" "Universities and colleges--Employees--Labor unions--Germany" "Collective bargaining--Academic librarians--Germany" "Libraries and labor unions--Germany" UNESCO Thesaurus: “Trade unions” “Academic libraries” “Germany” ILO Thesaurus: “Trade union” “library” “educational institution” “Germany” How to map to and from pre-coordinated classes and synthesized notations? For vocabularies using post-coordination (esp thesauri) mappings between them look feasible Mapping from a pre-coordinated or synthesized class to a thesaurus looks feasible. Mapping to a pre-coordinated class looks more problematic! The same applies to mapping from a synthesized class in one scheme to a differently synthesized class in another scheme Comparing subject headings with classification schemes, precoordination works in slightly different ways. Can we find common solutions? In any case, should the aim to be to map between schemes, or between the indexes of collections indexed/catalogued with the schemes? In the real world, mapping perfection is elusive… Mapping projects are labour intensive, and often underresourced Exact equivalence is all too rare Even when exact equivalence seems likely, it is often hard to be sure Some managers assume that mappings can be found by computers without human guidance Often the vocabularies to be mapped are poorly constructed Compound equivalence is needed commonly, but often unavailable Inclusion of pre-coordinate schemes makes it much harder Some systems allow only one mapping per concept While preparing mappings, you can’t make assumptions about capabilities of the search software Is there a need for ISO 25964-2? Consider the demand for networked applications which draw upon multiple heterogeneous resources Finding equivalent concepts cannot rely on comparison of text words alone Bear in mind the challenges listed above Practical experience of mapping is not widespread ISO 25964-2 provides guidance on good practice, mostly on the intellectual processes but also on the potential for automation Want a copy of ISO 25964-2 ? A draft is due to appear in early 2011, “ISO DIS 25964-2”, with the hope of attracting comments from potential users The official way to get it is through your national standards body (e.g. BSI, DIN) Distribution policies vary from one country to another; last time round we found a way to make the draft available online free of charge and free of passwords, on the BSI site. Send me an email and I’ll alert you when the DIS is released. stella@lukehouse.org Want to get involved? Contact your national standards body, specifically the committee corresponding to ISO TC 46/SC 9/WG8 17 countries already participate: Belgium, Bulgaria, Canada, China, Denmark, France, Germany, Finland, Korea, New Zealand, Russia, South Africa, Spain, Sweden, UK, Ukraine, USA While Part 1 of the standard will be published in 2011, Part 2 is still in draft. There is time for you to contribute ideas on interoperability!