A Note on the Formats used for Digitization Information

advertisement
A Note on the Formats used for Digitization Information
Thomas Fischer (fischer@mail.sub.uni-goettingen.de)
I will consider different formats used in the process of dissemination of information on digitized or digital material in the context of the World Digital Mathematics Library (WDML) and
the Emani initiative in particular. This focuses on the use as a format for exchange of data via
the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) .
I will consider four formats:
- The general Dublin Core Simple format
- The versions of the RFC1807 format used by Cornell University
- The minidml format provided by the NUMDAM project
- The suggestions for metadata exchange elaborated by Math Reviews and Zentralblatt.
As far as I know, these are the most advanced suggestions for the formats in question.
The OAI-MPH
The Open Archive Initiative1 has developed their “Protocol for Metadata Harvesting2” (PMH)
with the explicit goal to exhibit datasets for harvesting and aggregation of the data. In this
way, “Service Providers” can collect the information and provide enriched services on this
basis. It is in this way that the “Open” in the name is to be interpreted: not the archives are
necessarily open to everybody, but the information on the items is freely available.
The PMH offers different options for the presentation of the data. While a basic Dublin Core
format is required, other formats can be exhibited and requested through the protocol. This is
used by the providers of information on digitized mathematical literature, and is the main topic of these considerations. The OAI-PMH is considered an ideal means of exchanging information between the different digitization projects, and in fact is already in use by several of
them. The main question is: is there an optimal format that can or should be used by the mathematical digitization community, or if there is none, what should it look like?
The OAI has built a tool to investigate the offerings of the different OAI data providers3. This
is a combination of server side scripts and JavaScript on the client’s side, and a little awkward
to use, but still extremely helpful to dig into the archives available. Since the PMH is built on
top of the HTTP protocol, all requests can be given through a standard browser, so I give
those request as URLs. The returned result will usually be some form of XML, but it should
be recognizable. One caveat is that not all browsers handle XML well, and some are somewhat stricter in their application of the XML validation than others, for example, the Internet
Explorer will not display datasets that contain mangled umlauts.
The RFC 1807 format
The RFC 1807 is a “format for bibliographic records describing technical reports”, published
in 1995 before the inception of XML, let alone XML schema language. An XML version was
4
1
2
3
4
http://www.openarchives.org/
http://www.openarchives.org/OAI/openarchivesprotocol.html
http://re.cs.uct.ac.za/
http://www.faqs.org/rfcs/rfc1807.html
1
provided by the OAI in their implementation Guidelines5. This format is used at Cornell University for its digitization projects and for Project Euclid, albeit in slightly different versions.
RFC 1807 provides the following fields (<M> = mandatory, <O> = optional):
<M>
<M>
<M>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<O>
<M>
BIB-VERSION of this bibliographic records format
ID
ENTRY date
ORGANIZATION
TITLE
TYPE
REVISION
WITHDRAW
AUTHOR
CORP-AUTHOR
CONTACT for the author(s)
DATE of publication
PAGES count
COPYRIGHT, permissions and disclaimers
HANDLE
OTHER_ACCESS
RETRIEVAL
KEYWORD
CR-CATEGORY
PERIOD
SERIES
MONITORING organization(s)
FUNDING organization(s)
CONTRACT number(s)
GRANT number(s)
LANGUAGE name
NOTES
ABSTRACT
END
These have been carried over to the XML format, omitting only the “END” tag, which is unnecessary since XML provides its own wrapper.
Formats used at Cornell University Library and Project Euclid
The rfc1807 and the DC format are used by the Cornell University Library for their OAI service. The “Cornell University Library: Historical Mathematics Monographs6” (HMM) and the
“Project Euclid7” both give access to the data of their collections through the use of OAI data
providers. Here are some examples of datasets in this format, extracting the salient information and omitting the headers:
From Euclid:
5
6
7
See http://www.openarchives.org/OAI/2.0/guidelines-rfc1807.htm, The respective XML
schema document is available at http://www.openarchives.org/OAI/1.1/rfc1807.xsd.
http://historical.library.cornell.edu/math/
http://projecteuclid.org/
2
http://projecteuclid.org/Dienst?verb=GetRecord&metadataPrefix=oai_rfc1807&identifier=O
AI:CULeuclid:euclid.annm/1105737690
<rfc1807>
<bib-version>CS-TR-v2.1</bib-version>
<id>euclid.annm:1105737690</id>
<entry>January 18, 2005</entry>
<title>The space of embedded minimal surfaces of fixed genus in a 3-manifold II; Multivalued graphs for disks</title>
<type>text</type>
<author>Colding, Tobias H.</author>
<author>Minicozzi, William P.</author>
<date>July 2004</date>
<pages>24</pages>
</rfc1807>
http://projecteuclid.org/Dienst?verb=GetRecord&metadataPrefix=oai_rfc1807&identifier=O
AI:CULeuclid:euclid.hha:1088453320
<rfc1807>
<bib-version>CS-TR-v2.1</bib-version>
<id>euclid.hha:1088453320</id>
<entry>December 17, 2004</entry>
<title>COMPUTING LINKING NUMBERS OF A FILTRATION</title>
<type>text</type>
<author>EDELSBRUNNER, HERBERT</author>
<author>ZOMORODIAN, AFRA</author>
<date>January 2003</date>
<pages>19</pages>
<abstract><p> We develop fast algorithms for computing the linking number of a simplicial complex within a filtration.We give experimental results in applying our work toward
the detection of non-trivial tangling in biomolecules, modeled as alpha complexes.</p></abstract>
</rfc1807>
From Historical Mathematics Monographs:
http://mathbooks.library.cornell.edu:8085/Dienst?verb=GetRecord&metadataPrefix=oai_rfc1
807&identifier=OAI:CULmath:cul.math/00060001
<rfc1807>
<bib-version>CS-TR-v2.1</bib-version>
<id>cul.math:00060001</id>
<entry>November 11, 2002</entry>
<organization>B. G. Teubner</organization>
<title>Abriss einer theorie der algebraischen funktionen einer verèanderlichen in neuer fassung</title>
<type>text</type>
<author>Stahl, Hermann, 1843-1908.</author>
<date>1911</date>
<pages>116</pages>
<language>ger</language>
3
<notes>Computer file. Ithaca, NY : Cornell University Library, 1990. 116 image files.Files
for the images of individual pages are encoded in ALDUS/Microsoft TIFF Version 5.0 using
facsimile-compatible CITT Group 4 compression.</notes>
</rfc1807>
The same document using the Dublin Core Simple Format:
http://mathbooks.library.cornell.edu:8085/Dienst?verb=GetRecord&metadataPrefix=oai_dc&
identifier=OAI:CULmath:cul.math/00060001
<oai_dc:dc>
<dc:title>Abriss einer theorie der algebraischen funktionen einer verèanderlichen in neuer
fassung</dc:title>
<dc:creator>Stahl, Hermann, 1843-1908.</dc:creator>
<dc:subject>Algebraic functions.</dc:subject>
<dc:subject>Functions of complex variables.</dc:subject>
<dc:publisher>Cornell University Library</dc:publisher>
<dc:date>1911</dc:date>
<dc:type>text</dc:type>
<dc:identifier>http://mathbooks.library.cornell.edu:8085/GetRecord?id=cul.math/00060001
</dc:identifier>
<dc:identifier>cul.math/00060001</dc:identifier>
<dc:language>ger</dc:language>
<dc:rights>Copyright 2002 Cornell University Library</dc:rights>
</oai_dc:dc>
Short inspection shows the followings problems:
- The journal from which the article is taken is not directly available for the Euclid sets
(annm stands for “Annals of Mathematics”), the year, volume, issue and page numbers are
missing.
- The second entry uses capitals for title and author.
- There are several misprint in the abstract plus erroneous html tags.
- There are no URLs to access the article, this can only be deduced from other information
(the URL would look like http://projecteuclid.org/Dienst/UI/1.0/Display/euclid.annm/
1105737690). This is only available with the DC format from HMM.
- The given date in Project Euclid is not the date of publication.
- For the books in the HMM collection, the publisher is given only in the rfc1807 format, and
misleadingly given as “Cornell University Library” in the DC format.
- The HMM data are not using the UTF-8 encoding that they are promising in their header.
- Capitalisation in the German titles seems arbitrary.
- The author fields contains additional information on the lifespan of the author, this is not
intended in DC and not admissible in rfc1807.
Overall this information is insufficient for the exchange of digitization information, although
the rfc1807 format in particular could be enhanced to get richer and more rigid information:
- The field “OTHER_ACCESS“ could be used to give the URL (homepage) of the given
document.
- DATE should be the publication date, the field “ENTRY” gives the date of “creating this
bibliographic record”, this could be the same date as the production of the digital document:
in some sense this “bibliographic record” is the final stage of the digitization process.
- The “SERIES” field is meant to contain information on the journal, volume and issue.
4
-
Given the heritage if the rfc1807 format from technical reports, there is only a field “CRCATEGORY” for the classification according to the “Computer Reviews”, but an “MSCCATEGORY” would seem plausible, although extensions to the standard are not easily accomplished8.
The minidml format used with the NUMDAM project
The project “Numérisation de documents anciens mathématiques9“ (NUMDAM) is based at
the “Cellule de Coordination Documentaire Nationale pour les Mathématiques10” (Cellule
MathDoc) and provides access to digitisations of the most pre-eminent French mathematics
journal. The Cellule MathDoc has also started a “mini-DML Project11” that already collects
data from different digitization centres and makes them searchable; they have also developed
a “minidml” metadata format. The NUMDAM project offers the standard DC and the
minidml format through their OAI data provider12. Here is an example of an article from
ASENS in DC and minidml format:
DC:
http://www.numdam.org/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:numda
m.org:AIF_1973__23_1_157_0
<oai_dc:dc>
<dc:creator>Gérard, R.</dc:creator>
<dc:creator>Levelt, A. M.</dc:creator>
<dc:title>Invariants mesurant l'irrégularité en un point singulier des
systèmes d'équations différentielles linéaires</dc:title>
<dc:date>1973</dc:date>
<dc:identifier>AIF_1973__23_1_157_0</dc:identifier>
<dc:identifier>http://www.numdam.org/item?id=AIF_1973__23_1_157_0</dc:identifier>
<dc:identifier>oai:numdam.org:AIF_1973__23_1_157_0</dc:identifier>
<dc:identifier>citation: Ann. Inst. Fourier 23, no.1, 157-195 (1973)</dc:identifier>
<dc:identifier>MR 49 #10947</dc:identifier>
<dc:identifier>Zbl 0243.35016</dc:identifier>
</oai_dc:dc>
minidml:
http://www.numdam.org/oai?verb=GetRecord&metadataPrefix=minidml&identifier=oai:num
dam.org:AIF_1973__23_1_157_0
<minidml>
<author>Gérard, R.</author>
<author>Levelt, A. M.</author>
8
9
10
11
12
While the XML Schema document is provided by the OAI, the underlying format referred
to is “CS-TR-v2.1”, which is part of RFC 1807 and refers to CR-CATEGORY and not to
MSC.
The website is http://www.numdam.org/, an English version is available at
http://www.numdam.org/en/.
http://www-mathdoc.ujf-grenoble.fr/
http://www.numdam.org/minidml/, for more information on the project see “Introducing
the mini-DML project” by Thierry Bouche
(http://www.numdam.org/minidml/litterature/article-minidml.pdf).
http://www.numdam.org/oai
5
<title>Invariants mesurant l'irrégularité en un point singulier des
systèmes d'équations différentielles linéaires</title>
<language>fr</language>
<identifier scheme="internal">AIF_1973__23_1_157_0</identifier>
<identifier
scheme="url">http://www.numdam.org/item?id=AIF_1973__23_1_157_0</identifier>
<identifier scheme="oai">oai:numdam.org:AIF_1973__23_1_157_0</identifier>
<citation>Ann. Inst. Fourier 23, no.1, 157-195 (1973)</citation>
<jtitle>Annales de l'institut Fourier</jtitle>
<home>http://annalif.ujf-grenoble.fr</home>
<provider>
<name>Project Numdam</name>
<home>http://www.numdam.org</home>
</provider>
<abbrev>Ann. Inst. Fourier</abbrev>
<volume>23</volume>
<issue>1</issue>
<date>1973</date>
<pages>157-195</pages>
<format>application/pdf</format>
<format>application/x-djvu</format>
<reviewid service="zbl">0243.35016</reviewid>
<reviewid service="mr">49 #10947</reviewid>
</minidml>
There are some minor problems with these formats. The OAI Repository Explorer complains
about the DC format:
[Error] re.C3IbZK:15:94: cvc-complex-type.2.4.a: Invalid content was found
starting with element 'oai_dc:dc'. One of
'{"http://www.openarchives.org/OAI/2.0/oai_dc/":dc}' is expected.
(There seems to be a slash “/” missing after
http://www.openarchives.org/OAI/2.0/oai_dc.)
And there are mistakes in the minidml format as well:
[Error] minidml.xsd:34:32: src-resolve: Cannot resolve the name
'minidml:ProviderType' to a(n) 'type definition' component.
(the ProviderType is defined as providerType in the XML schema document)
[Error] re.HJV7MN:24:17: cvc-complex-type.2.4.a: Invalid content was found
starting with element 'home'.
(the name of the element is jhome, not home).
And actually, the encoding of the data is not UTF-8, but Latin-1 in spite of the XML header
information.
Nevertheless, the DC format contains full citation information in the <dc:identifier> citation:…</dc:identifier> construct. This is some sort of a Dublin Core Structured Value 13 and
13
See http://dublincore.org/documents/dcmi-dcsv/index.shtml for the full description, in
particular, the format suggested is slightly different and might call for “citation=”, but that
does not make a difference for the given considerations.
6
as such can be used if it is understood by the recipient. The unqualified dc:identifier presents
the URL of the given document.
It should also be noted that NUMDAM provides – if available – the numbers of the review
articles in Math Reviews and Zentralblatt. This is extremely important, since these databases,
together with some others like the “Jahrbuch über Fortschritte der Mathematik”, provide the
anchors to the article in the web of mathematics.
While this is in some sense sufficient, the richer information is available through the minidml
format, providing the full name and the home page of the journals and the bibliographic information in separate fields. The complex type “provider” is a container for the identification
of the digitization project, giving the name and the website.
Additional fields not present in the example are: keyword, abstract, msc, issn, publisher,
rights with essentially the obvious meaning. It should be noted that some of the fields like title, keyword or abstract can carry an additional language qualifier, the “scheme” used with the
identifier and the “service” with the reviewed appear already in the example.
The Propositions by Mathematical Reviews and Zentralblatt
In January 2004, Jane Kister from Mathematical Reviews and Bernd Wegner from Zentralblatt MATH proposed “Standards for metadata for digitized mathematics”14. The proposal
called essentially for the use of BibTEX as the format to use to provide information on the
digitized items. An XML format was also presented which could serve as an exchange format
for the OAI-PMH. Here are the proposed fields:
AUTHOR: required
One or more author names, last name first, connected by ‘and’, such as “Brelot, Marcel
and Choquet, Gustave”, as they appear on the title page.
TITLE: required
The title of the item as it appears on the title page.
JOURNAL: optional
An abbreviation for the journal name; it is recommended that the MR or ZBL abbreviation be used.
FJOURNAL: required
Full name of journal as it appears on the title page of the journal.
VOLUME; optional
The volume number; this may include multiple volume numbers, such as “34/35”
YEAR: required
The year of publication; this may include multiple years, such as “1938-39”.
NUMBER: optional
The issue number; this may include multiple issues such as “2-3”
PAGES: required
The paging string; this may include ranges such as ‘1–23” or more complicated strings
ISSN: optional
The International Standard Serial Number; this may include hyphens
14
The proposal is available at http://www.wdml.org/standards/metadata.pdf.
7
URL: required
The web location of this item.
NOTE: optional
For notes or miscellaneous information
MRID: optional
The identifier used in MathSciNet
ZBLID: optional
The identifier used in Zentralblatt MATH
JFMID: optional
The identifier used in Jahrbuch über die Fortschritte der Mathematik (JFM)
This information can all be contained in the minidml format, and as the minidml, only applies
to journals. The rules, which fields are optional and which are necessary might be somewhat
disputed, for example, the abbreviated name of a journal might usually rather be used than the
full name, and the review identifiers, if at all available, should definitely be part of the dataset.
But in general, there is no serious contradiction between these suggestions and NUMDAM’s
minidml, and only differences in the XML used to wrap the data. On the other hand, the
minidml provides a more rigid and precise format.
Suggestions
I think that the minidml format should be taken as the starting point for the format used to exchange metadata between digitization projects. Since this is directed towards articles, for
books and other formats, some extensions will be necessary. I propose to consider the following information.
Editor
To differentiate the role of an editor of a collection from the role of an author, this seems to be
useful if books are considered. Note that this is quite a different role from the editor of a journal: a book usually has either author(s) or editor(s), but not both. So one could use a field
“creator” for both author and editor, probably with the standard extension (ed.) for the editors.
Alternatively, an additional field “editor” could be created.
ISBN
For books, the book number instead of the serial number is needed. This could be a new field
“isbn” or instead a field like “standardNumber” with schemes issn and isbn. Note that this is
not really an “identifier”, since one object may have several ISBNs.
place of publication
This is an essential part of the bibliographic information for some citation rules, so it should
be present. This could be part of an enhanced “publisher” field as in:
<publisher>Heidelberg, New York, Tokyo: Springer</publisher>
(or vice versa, or with a different separator) or a complex type as in
<publisher>
<pname>Springer</pname>
<pplace>Heidelberg, New York, Tokyo</pplace>
</publisher>.
8
In addition to these extensions to the minidml format (which in the process probably could be
given a more “authoritative” name like “wdml”), I would like to suggest some additional
“best practise” rules for the implementation of the OAI service.
sets
The OAI allows the organisation of collections into sets. These are in some sense sub-collections that could be retrieved individually. It seems useful to give
- each journal as a separate set,
- sets referring to the first two digits of the MSC classification (if available).
This would allow for the dedicated collection of particular journals or subjects. In addition,
the setDescriptor can be used to give additional information on the journal that is not necessary for every single article. Simple Dublin Core could be used as a format for this15.
Names of files
NUMDAM uses a quite sophisticated schedule for the names and identifiers of the individual
files that allow to identify the file if the citation is known and vice versa. It would seem useful
if the same or a similar scheme would be used by the digitization projects for filenames and/or
identifiers if possible.
Character set
In general, the character set used should be indicated in the XML header. In particular, the
integration of further digitization projects (e.g. like RusDML) requires precise information on
the encoding used. Probably UTF-8 should be used throughout. Apart from this, capitals
should be used only where required, not for whole names or titles.
Names of persons
Although the “Last Name First” rule used with minidml is somewhat arbitrary, it will be useful for service providers to sort result sets and probably increase precision in the search. I regard this as a minimal standard, giving the name in the sorting order. An alternative might be
a special symbol signifying the sorting word, e.g.
<creator>Andrew ^Lloyd Webber</creator>
but his might be confusing if visible and overlooked if invisible (like e.g. non-breaking
space).
I rather had separate fields for full name (as written) and sort form, but do not see how this
can be accomplished without making the format quite a bit more complicated, e.g.:
<creator>
<writtenName>Andrew Lloyd Webber</writtenName>
<sortName>Lloyd Webber, Andrew</sortName>
<alternativeName> Andrew Lloyd-Webber</alternativeName>
</creator>
for each single creator. But this form would allow additionally the reference to a name authority file (if available) and so could solve the infamous Cebychev (or Tschebytscheff, or Tchebychef …) problem. Finally, with the inclusion of Latin American or Chinese authors, anything less might be insufficient.
15
See http://www.openarchives.org/OAI/2.0/guidelines-repository.htm#setDescription
9
In general, time seems ripe to get to an agreement on the format for data exchange between
digitization projects and providers of mathematics in digital format in general. Consensus is
already sufficiently advanced and should be brought to fruition as soon as possible.
10
Download