Emerging standards for libraries and publishers

advertisement
Emerging Standards for Libraries
and Publishers
Cliff Morgan, John Wiley & Sons Ltd
UKSG briefing session, 15-17 April 2002
What I’ll be covering
 Identifiers
 Metadata
 E-books
What I won’t be covering
Graphics (e.g. JPEG, GIF, PNG, SVG)
 Character sets (ASCII, Unicode)
 Relationship models (RDF, Topic Maps/XTM)
 E-commerce (UN/EDIFACT, XML-edi, ebXML)
 XML stuff (Schemas, Xlink, XSL, XSLT, etc.)
 Usage stats standards (e.g. COUNTER, ANSI/NISO
Z39.7-1995)
 Rights metadata (XrML, ODRL)

Identifiers
 ISSN
 ISBN
 SICI
 BICI
 PII
 DOI
 ISTC
 Multimedia identifiers
ISBN
 International Standard Book Number
 ISO 2108
 e.g. 0-471-92755-4
 Geog location/language -
publisher/imprint - title (print format) check character
 Has been a standard for > 30 years
New ISBN
 ISBN is being revised - 13 digits from 1/1/05
 Can double capacity by giving a 979 prefix
 Issues:
- hexadecimal or decimal?
- limit ISBN to print - do something else
for electronic? versions? formats?
- assign to components (e.g. chaps)?
- should number be completely dumb?
- metadata deposit at assignment?
ISSN
 International Standard Serial Number
 ISO 3297
 e.g. 0749-503X
 If publisher has not applied for an ISSN, any
3rd party can apply for their own data
management needs
 Different media get different ISSNs, e.g.
print ISSN is different from CD-ROM ISSN
 But different file formats don’t get different
ISSNs, so offline is different from online, but
PDF is same as HTML
 If online contains only abstracts of print
full text, no new ISSN for e-version
 If use print and eISSNs, must change both if
title changes

http://www.issn.org:8080/English/pub/getting-checking
SICI
 Serial Item and Contribution Identifier
 ANSI/NISO Z39.56-1996 - reaffirmed

e.g. issue=0749-503X(20010115)18:1<>1.0.TX;2-X
Art. = 0749-503X(20010115)18:1<1:YGPIWG>2.0.TX;2-X
(Check digits in above examples have not been calculated.)
 Well used at issue level - bar codes
 Less used at article level
SICIs at Article Level
Requires publication info - but publishers want to
assign article Ids before pubn
 Long-winded
 Unfortunate syntax for Internet transfer (<>, #) needs SGML entifying and hex encoding
 Unclear what to do with special characters in
Title Code
 Not unique ID if two untitled articles on same
page (e.g. Letters)

C = Contribution, not Component
 SICI allows identification of article, issue
ToC, issue Index and article abstract (DPIs
of 0, 1, 2, 3 respectively)
 No way of using SICI to identify any other
component (such as Figure, Table, Section)
 Not surprising since it’s a canonicalisation
nightmare

http://sunsite.berkeley.edu/SICI/version2.html
BICI
 Book Item and Component Identifier
 ISO DSFTU (Draft Standard for Trial Use)
 e.g. 0387119787(1982)<174:ADTATO>2.2.TX;1-Q

ISBN, date, location, title, component type, etc.
 Trial was Aug 2000 to Jan 2002 - not much
evidence of use
 Many issues the same as for SICI, but also
less business push
PII
 Publisher Item Identifier
 Proposed in 1995 by ACS, AIP, APS, IEEE and
Elsevier, but never became a standard
 e.g. S0749-503X011234
 Some publishers use as internal id since
doesn’t suffer from any of the SICI problems
 But no registration/maintenance agency
DOI
 Digital Object Identifier
 ANSI/NISO Z39.84-2000
 e.g. issue = 10.1002/yea.v18:1
article = 10.1002/yea.1234
 Well established in academic journals
publishing - esp. ‘cos of CrossRef
 4.2 million DOIs deposited to date

http://www.doi.org
Some publishing issues
regarding DOIs
What are they assigned to?
 Need for matching URL, so can’t assign to anything
you wouldn’t give a URL to
 Individual publishers need to decide their DOI
structure
 Doesn’t have to be human-friendly but must be
unique, easily generated, and matched with URL
 Application profiles for different genres

Processes
 Apply to Registration Agency (IDF, CDI,
CrossRef, Enpia, LON) for Registrant Prefix
 For individual DOIs, batch-process generate DOIs and URLs from electronic
metadata and send to RA for deposit
 DOIs never change (even if journal changes
ownership) but matched URLs (or other
locators) can
ISTC
International Standard Textual Work Code
 ISO Committee Draft 21047 - circulated Oct 01,
voting finished Jan 02: progressed to Enquiry
stage
 http://www.nlc-bnc.ca/iso/tc46sc9/21047.htm
 E.g. 0A9-2002-1223F332-0
(RA+year+WorkID+check)
 A Work (= abstract creation) id - replaces the
ISWC(L)

 Creator-centric - authors may apply to ISTC
Agency directly or via agents or via
publisher
 Requires metadata deposit too
 Publishers therefore need to capture these
numbers if they’ve been assigned to Works
 Will authors really bother with this?
A couple of non-text, non-graphic
Ids you might want to know about
 ISAN
 ISWC
ISAN
International Standard Audiovisual Number
 ISO Draft International Standard 15706
 E.g. 153C-7365-B36F-844C-N
 Can be issued to movies, trailers, TV programmes,
episodes or series, ads, multimedia works if A/V
component is significant
 http://www.nlc-bnc.ca/iso/tc46sc9/isan.htm


Work has also started on a V-ISAN for Versions
ISWC
 International Standard Musical Work Code
(used to be ISWC(T))
 ISO 15707
 e.g. T-034524680-1
 Identifies any musical work, including
arrangements, movements, medleys,
samples

http://www.iswc.org/iswc/iswc/en/html/home.html
Metadata
 Resource discovery (Dublin Core, OAI-PMH),
incl. Linking (CrossRef)
 Product metadata (ONIX and ONIX for
Serials)
 Preservation metadata (OAIS)

I am not going to talk about library-specific sets such as
MARC, Z-3950, AACR2, etc.
Dublin Core
 Defined Universal Bibliographic Language
for Internet Navigation and Coherent Online
Resource Exploration [not really!]
 ANSI Z-3985
 DC 1.1 (simple, unqualified set of 15
elements)
 Qualified set (DCQ? dcterms?) needed to do
anything more than basic - not standard yet
 DC has been mandated by UK Government
(“e-GMS”)
 Application Profiles will deal with defined
local extensions via namespace
declarations
OAI-PMH








Open Archives Initiative Protocol for Metadata Harvesting
Not really an archive in the sense of repository, more of a
political statement and a metadata harvesting protocol
Came out of the E-print community, but they welcome
commercial publishers
Supported by DLF and CNI
Uses simple (unqualified) Dublin Core as its metadata
E.g. <creator>Cliff Morgan</>
Version 2 of protocol due for release June 2002
http://www.openarchives.org
CrossRef metadata set
 CrossRef matches the metadata in a
citation with the metadata in its Metadata
Database (MDDB), which includes the DOI
for the resource
 Participating publishers (91 of ‘em) deposit
the m/data with DOI into the MDDB
 To date, 3.7M DOIs, covering 5000+ jnls

http://www.crossref.org
New version
 Version 2 much more complicated - full
schema is 113 pages long
 In addition to journals, covers books and
conference proceedings, at whole title and
chapter level
 Some element names are different from
CrossRef 1.0
ONIX
OnLine Information eXchange
 Latest release is 2.0
 Original focus was message format for books
through the trade, but is fast becoming a
universal metadata set for describing
publications
 http://www.editeur.org

 ONIX being championed by a number of
publishers and online retailers
 Swedish Royal Library using ONIX as an
input medium
ONIX for Serials
 Provides rich cataloguing information for
agents, librarians, users
 Supports alerting, despatch and library
check-in
 Structured, multi-level bibliographic
descriptions, including ToCs
 Descriptions for library holdings (direct to
OPACs)
Draft 2 just released this month
 Subscription Package Record provides product
catalogue info about subscription packages
 Serial Title Record provides catalogue info about
an individual serial
 Serial Item Record provides structured multilevel bibliographic description of serial parts

So is the CrossRef set like the ONIX
for Serials set?
 No
 They both include metadata that can be
used to describe journals, issues and
articles
 But they don’t use the same element names
 CrossRef has mapped to ONIX but not to
ONIX for Serials yet - but has said will
support when released
OpenURL
NISO Work Item
 Separates metadata for resource from metadata
for location
 Resolver services (such as SFX, CrossRef) make
the context-sensitive link
 Solves the “appropriate copy” problem, where
more than one legit copy of an article may be
available to a library, e.g. local holding,
consortium, aggregator service, mirror site,
publisher

OpenURL metadata
 OpenURL comprises BASEURL and QUERY
 BASEURL identifies the resolver; QUERY is a
resource description
 e.g. (simplified):
http://resolver.ukoln.ac.uk/genre=article
&atitle=Information%20gateways:…
&issn=14684527&volume=24&spage=4
0 &aulast=Heery&aufirst=Rachel
 Genres defined as “referent-types”, such as
book, chapter, journal, article, conf proc
and paper, dissertation, patent, report each has its own metadata spec
 High-level concept is the Bison-Futé model
http://www.dlib.org/dlib/july01/vandesompel/07vandesompel.html
Preservation metadata
 OAIS (Open Archival Information System)
underlies all digital preservation models
 Nothing to do with OAI
 Based on SIPs (Submission Info Packages),
AIPs (Archival Info Packages) and DIPs
(Dissemination Info Packages)
 The Producer wraps the stuff up in a SIP, it gets
ingested into an AIP, and sent out as a DIP
Some other metadata activities
LOM - Learning Object Model
 IMS - Instructional Management Set (builds on
LOM)
 PRISM - Publishing Requirements for Industry
Standard Metadata
 MEG - cross-sectoral Metadata for Education
Group
 SCORM - Shared Contents Objects Reference
Model - US DoD project, also builds on IMS/LOM

How are we supposed to cope with
all these metadata sets?
A publisher’s metadata becomes an important
asset for describing product to the outside world,
esp. for trading and linking
 If publishers have their publications in electronic
form, the metadata will be in there in the file so it
just needs extracting and mapping to whatever
metadata set the publisher chooses
 Production issue: who checks the metadata?

E-books
OEBPS - Open E-Book Publication Structure
 Three components:
a) XML DTD for content
b) DC-based metadata (but some noncompliant qualifier attributes)
c) description of package’s structure,
reading order, navigation
 Many OEB files are just (a)
 Version 2 being worked on, esp. M&I, and Rights

Formats
 Front runners are Adobe E-Book Reader
(PDF based) and Microsoft Reader (.lit
based)
 .lit limited to simple stuff, and not as robust
as PDF, but can’t underestimate M/soft
 New versions of Adobe will have built-in DOI
capability
Text reflow
 Acrobat 5 introduced sructured PDF
 The Holy Grail synthesis of structure and
presentation
 Writes a PDF file in XML(ish)
 Asserts reading order
 Allows for reflow into different reader
devices
 Works best for simple only, but good start
Conclusions
 There are lots of standards out there
 Some of them compete with one another
 Not all of them are formal
 They may change over time
 Publishing industry standards are not only
developed by the publishing industry
 Not always easy to judge the winners
Download