Emerging Standards for Libraries and Publishers Cliff Morgan, John Wiley & Sons Ltd UKSG briefing session, 15-17 April 2002 What I’ll be covering Identifiers Metadata E-books What I won’t be covering Graphics (e.g. JPEG, GIF, PNG, SVG) Character sets (ASCII, Unicode) Relationship models (RDF, Topic Maps/XTM) E-commerce (UN/EDIFACT, XML-edi, ebXML) XML stuff (Schemas, Xlink, XSL, XSLT, etc.) Usage stats standards (e.g. COUNTER, ANSI/NISO Z39.7-1995) Rights metadata (XrML, ODRL) Identifiers ISSN ISBN SICI BICI PII DOI ISTC Multimedia identifiers ISBN International Standard Book Number ISO 2108 e.g. 0-471-92755-4 Geog location/language - publisher/imprint - title (print format) check character Has been a standard for > 30 years New ISBN ISBN is being revised - 13 digits from 1/1/05 Can double capacity by giving a 979 prefix Issues: - hexadecimal or decimal? - limit ISBN to print - do something else for electronic? versions? formats? - assign to components (e.g. chaps)? - should number be completely dumb? - metadata deposit at assignment? ISSN International Standard Serial Number ISO 3297 e.g. 0749-503X If publisher has not applied for an ISSN, any 3rd party can apply for their own data management needs Different media get different ISSNs, e.g. print ISSN is different from CD-ROM ISSN But different file formats don’t get different ISSNs, so offline is different from online, but PDF is same as HTML If online contains only abstracts of print full text, no new ISSN for e-version If use print and eISSNs, must change both if title changes http://www.issn.org:8080/English/pub/getting-checking SICI Serial Item and Contribution Identifier ANSI/NISO Z39.56-1996 - reaffirmed e.g. issue=0749-503X(20010115)18:1<>1.0.TX;2-X Art. = 0749-503X(20010115)18:1<1:YGPIWG>2.0.TX;2-X (Check digits in above examples have not been calculated.) Well used at issue level - bar codes Less used at article level SICIs at Article Level Requires publication info - but publishers want to assign article Ids before pubn Long-winded Unfortunate syntax for Internet transfer (<>, #) needs SGML entifying and hex encoding Unclear what to do with special characters in Title Code Not unique ID if two untitled articles on same page (e.g. Letters) C = Contribution, not Component SICI allows identification of article, issue ToC, issue Index and article abstract (DPIs of 0, 1, 2, 3 respectively) No way of using SICI to identify any other component (such as Figure, Table, Section) Not surprising since it’s a canonicalisation nightmare http://sunsite.berkeley.edu/SICI/version2.html BICI Book Item and Component Identifier ISO DSFTU (Draft Standard for Trial Use) e.g. 0387119787(1982)<174:ADTATO>2.2.TX;1-Q ISBN, date, location, title, component type, etc. Trial was Aug 2000 to Jan 2002 - not much evidence of use Many issues the same as for SICI, but also less business push PII Publisher Item Identifier Proposed in 1995 by ACS, AIP, APS, IEEE and Elsevier, but never became a standard e.g. S0749-503X011234 Some publishers use as internal id since doesn’t suffer from any of the SICI problems But no registration/maintenance agency DOI Digital Object Identifier ANSI/NISO Z39.84-2000 e.g. issue = 10.1002/yea.v18:1 article = 10.1002/yea.1234 Well established in academic journals publishing - esp. ‘cos of CrossRef 4.2 million DOIs deposited to date http://www.doi.org Some publishing issues regarding DOIs What are they assigned to? Need for matching URL, so can’t assign to anything you wouldn’t give a URL to Individual publishers need to decide their DOI structure Doesn’t have to be human-friendly but must be unique, easily generated, and matched with URL Application profiles for different genres Processes Apply to Registration Agency (IDF, CDI, CrossRef, Enpia, LON) for Registrant Prefix For individual DOIs, batch-process generate DOIs and URLs from electronic metadata and send to RA for deposit DOIs never change (even if journal changes ownership) but matched URLs (or other locators) can ISTC International Standard Textual Work Code ISO Committee Draft 21047 - circulated Oct 01, voting finished Jan 02: progressed to Enquiry stage http://www.nlc-bnc.ca/iso/tc46sc9/21047.htm E.g. 0A9-2002-1223F332-0 (RA+year+WorkID+check) A Work (= abstract creation) id - replaces the ISWC(L) Creator-centric - authors may apply to ISTC Agency directly or via agents or via publisher Requires metadata deposit too Publishers therefore need to capture these numbers if they’ve been assigned to Works Will authors really bother with this? A couple of non-text, non-graphic Ids you might want to know about ISAN ISWC ISAN International Standard Audiovisual Number ISO Draft International Standard 15706 E.g. 153C-7365-B36F-844C-N Can be issued to movies, trailers, TV programmes, episodes or series, ads, multimedia works if A/V component is significant http://www.nlc-bnc.ca/iso/tc46sc9/isan.htm Work has also started on a V-ISAN for Versions ISWC International Standard Musical Work Code (used to be ISWC(T)) ISO 15707 e.g. T-034524680-1 Identifies any musical work, including arrangements, movements, medleys, samples http://www.iswc.org/iswc/iswc/en/html/home.html Metadata Resource discovery (Dublin Core, OAI-PMH), incl. Linking (CrossRef) Product metadata (ONIX and ONIX for Serials) Preservation metadata (OAIS) I am not going to talk about library-specific sets such as MARC, Z-3950, AACR2, etc. Dublin Core Defined Universal Bibliographic Language for Internet Navigation and Coherent Online Resource Exploration [not really!] ANSI Z-3985 DC 1.1 (simple, unqualified set of 15 elements) Qualified set (DCQ? dcterms?) needed to do anything more than basic - not standard yet DC has been mandated by UK Government (“e-GMS”) Application Profiles will deal with defined local extensions via namespace declarations OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting Not really an archive in the sense of repository, more of a political statement and a metadata harvesting protocol Came out of the E-print community, but they welcome commercial publishers Supported by DLF and CNI Uses simple (unqualified) Dublin Core as its metadata E.g. <creator>Cliff Morgan</> Version 2 of protocol due for release June 2002 http://www.openarchives.org CrossRef metadata set CrossRef matches the metadata in a citation with the metadata in its Metadata Database (MDDB), which includes the DOI for the resource Participating publishers (91 of ‘em) deposit the m/data with DOI into the MDDB To date, 3.7M DOIs, covering 5000+ jnls http://www.crossref.org New version Version 2 much more complicated - full schema is 113 pages long In addition to journals, covers books and conference proceedings, at whole title and chapter level Some element names are different from CrossRef 1.0 ONIX OnLine Information eXchange Latest release is 2.0 Original focus was message format for books through the trade, but is fast becoming a universal metadata set for describing publications http://www.editeur.org ONIX being championed by a number of publishers and online retailers Swedish Royal Library using ONIX as an input medium ONIX for Serials Provides rich cataloguing information for agents, librarians, users Supports alerting, despatch and library check-in Structured, multi-level bibliographic descriptions, including ToCs Descriptions for library holdings (direct to OPACs) Draft 2 just released this month Subscription Package Record provides product catalogue info about subscription packages Serial Title Record provides catalogue info about an individual serial Serial Item Record provides structured multilevel bibliographic description of serial parts So is the CrossRef set like the ONIX for Serials set? No They both include metadata that can be used to describe journals, issues and articles But they don’t use the same element names CrossRef has mapped to ONIX but not to ONIX for Serials yet - but has said will support when released OpenURL NISO Work Item Separates metadata for resource from metadata for location Resolver services (such as SFX, CrossRef) make the context-sensitive link Solves the “appropriate copy” problem, where more than one legit copy of an article may be available to a library, e.g. local holding, consortium, aggregator service, mirror site, publisher OpenURL metadata OpenURL comprises BASEURL and QUERY BASEURL identifies the resolver; QUERY is a resource description e.g. (simplified): http://resolver.ukoln.ac.uk/genre=article &atitle=Information%20gateways:… &issn=14684527&volume=24&spage=4 0 &aulast=Heery&aufirst=Rachel Genres defined as “referent-types”, such as book, chapter, journal, article, conf proc and paper, dissertation, patent, report each has its own metadata spec High-level concept is the Bison-Futé model http://www.dlib.org/dlib/july01/vandesompel/07vandesompel.html Preservation metadata OAIS (Open Archival Information System) underlies all digital preservation models Nothing to do with OAI Based on SIPs (Submission Info Packages), AIPs (Archival Info Packages) and DIPs (Dissemination Info Packages) The Producer wraps the stuff up in a SIP, it gets ingested into an AIP, and sent out as a DIP Some other metadata activities LOM - Learning Object Model IMS - Instructional Management Set (builds on LOM) PRISM - Publishing Requirements for Industry Standard Metadata MEG - cross-sectoral Metadata for Education Group SCORM - Shared Contents Objects Reference Model - US DoD project, also builds on IMS/LOM How are we supposed to cope with all these metadata sets? A publisher’s metadata becomes an important asset for describing product to the outside world, esp. for trading and linking If publishers have their publications in electronic form, the metadata will be in there in the file so it just needs extracting and mapping to whatever metadata set the publisher chooses Production issue: who checks the metadata? E-books OEBPS - Open E-Book Publication Structure Three components: a) XML DTD for content b) DC-based metadata (but some noncompliant qualifier attributes) c) description of package’s structure, reading order, navigation Many OEB files are just (a) Version 2 being worked on, esp. M&I, and Rights Formats Front runners are Adobe E-Book Reader (PDF based) and Microsoft Reader (.lit based) .lit limited to simple stuff, and not as robust as PDF, but can’t underestimate M/soft New versions of Adobe will have built-in DOI capability Text reflow Acrobat 5 introduced sructured PDF The Holy Grail synthesis of structure and presentation Writes a PDF file in XML(ish) Asserts reading order Allows for reflow into different reader devices Works best for simple only, but good start Conclusions There are lots of standards out there Some of them compete with one another Not all of them are formal They may change over time Publishing industry standards are not only developed by the publishing industry Not always easy to judge the winners