A Note on the Formats used for Digitization Information

A Note on the Formats used for Digitization Information Thomas Fischer (fischer@mail.sub.uni-goettingen.de) I will consider different formats used in the process of dissemination of information on digitized or digital material in the context of the World Digital Mathematics Library (WDML) and the Emani initiative in particular. This focuses on the use as a format for exchange of data via the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) . I will consider four formats: - The general Dublin Core Simple format - The versions of the RFC1807 format used by Cornell University - The minidml format provided by the NUMDAM project - The suggestions for metadata exchange elaborated by Math Reviews and Zentralblatt. As far as I know, these are the most advanced suggestions for the formats in question. The OAI-MPH The Open Archive Initiative1 has developed their “Protocol for Metadata Harvesting2” (PMH) with the explicit goal to exhibit datasets for harvesting and aggregation of the data. In this way, “Service Providers” can collect the information and provide enriched services on this basis. It is in this way that the “Open” in the name is to be interpreted: not the archives are necessarily open to everybody, but the information on the items is freely available. The PMH offers different options for the presentation of the data. While a basic Dublin Core format is required, other formats can be exhibited and requested through the protocol. This is used by the providers of information on digitized mathematical literature, and is the main topic of these considerations. The OAI-PMH is considered an ideal means of exchanging information between the different digitization projects, and in fact is already in use by several of them. The main question is: is there an optimal format that can or should be used by the mathematical digitization community, or if there is none, what should it look like? The OAI has built a tool to investigate the offerings of the different OAI data providers3. This is a combination of server side scripts and JavaScript on the client’s side, and a little awkward to use, but still extremely helpful to dig into the archives available. Since the PMH is built on top of the HTTP protocol, all requests can be given through a standard browser, so I give those request as URLs. The returned result will usually be some form of XML, but it should be recognizable. One caveat is that not all browsers handle XML well, and some are somewhat stricter in their application of the XML validation than others, for example, the Internet Explorer will not display datasets that contain mangled umlauts. The RFC 1807 format The RFC 1807 is a “format for bibliographic records describing technical reports”, published in 1995 before the inception of XML, let alone XML schema language. An XML version was 4 1 2 3 4 http://www.openarchives.org/ http://www.openarchives.org/OAI/openarchivesprotocol.html http://re.cs.uct.ac.za/ http://www.faqs.org/rfcs/rfc1807.html 1 provided by the OAI in their implementation Guidelines5. This format is used at Cornell University for its digitization projects and for Project Euclid, albeit in slightly different versions. RFC 1807 provides the following fields (<M> = mandatory, <O> = optional): <M> <M> <M> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <O> <M> BIB-VERSION of this bibliographic records format ID ENTRY date ORGANIZATION TITLE TYPE REVISION WITHDRAW AUTHOR CORP-AUTHOR CONTACT for the author(s) DATE of publication PAGES count COPYRIGHT, permissions and disclaimers HANDLE OTHER_ACCESS RETRIEVAL KEYWORD CR-CATEGORY PERIOD SERIES MONITORING organization(s) FUNDING organization(s) CONTRACT number(s) GRANT number(s) LANGUAGE name NOTES ABSTRACT END These have been carried over to the XML format, omitting only the “END” tag, which is unnecessary since XML provides its own wrapper. Formats used at Cornell University Library and Project Euclid The rfc1807 and the DC format are used by the Cornell University Library for their OAI service. The “Cornell University Library: Historical Mathematics Monographs6” (HMM) and the “Project Euclid7” both give access to the data of their collections through the use of OAI data providers. Here are some examples of datasets in this format, extracting the salient information and omitting the headers: From Euclid: 5 6 7 See http://www.openarchives.org/OAI/2.0/guidelines-rfc1807.htm, The respective XML schema document is available at http://www.openarchives.org/OAI/1.1/rfc1807.xsd. http://historical.library.cornell.edu/math/ http://projecteuclid.org/ 2 http://projecteuclid.org/Dienst?verb=GetRecord&metadataPrefix=oai_rfc1807&identifier=O AI:CULeuclid:euclid.annm/1105737690 <rfc1807> <bib-version>CS-TR-v2.1</bib-version> <id>euclid.annm:1105737690</id> <entry>January 18, 2005</entry> <title>The space of embedded minimal surfaces of fixed genus in a 3-manifold II; Multivalued graphs for disks</title> <type>text</type> <author>Colding, Tobias H.</author> <author>Minicozzi, William P.</author> <date>July 2004</date> <pages>24</pages> </rfc1807> http://projecteuclid.org/Dienst?verb=GetRecord&metadataPrefix=oai_rfc1807&identifier=O AI:CULeuclid:euclid.hha:1088453320 <rfc1807> <bib-version>CS-TR-v2.1</bib-version> <id>euclid.hha:1088453320</id> <entry>December 17, 2004</entry> <title>COMPUTING LINKING NUMBERS OF A FILTRATION</title> <type>text</type> <author>EDELSBRUNNER, HERBERT</author> <author>ZOMORODIAN, AFRA</author> <date>January 2003</date> <pages>19</pages> <abstract><p> We develop fast algorithms for computing the linking number of a simplicial complex within a filtration.We give experimental results in applying our work toward the detection of non-trivial tangling in biomolecules, modeled as alpha complexes.</p></abstract> </rfc1807> From Historical Mathematics Monographs: http://mathbooks.library.cornell.edu:8085/Dienst?verb=GetRecord&metadataPrefix=oai_rfc1 807&identifier=OAI:CULmath:cul.math/00060001 <rfc1807> <bib-version>CS-TR-v2.1</bib-version> <id>cul.math:00060001</id> <entry>November 11, 2002</entry> <organization>B. G. Teubner</organization> <title>Abriss einer theorie der algebraischen funktionen einer verèanderlichen in neuer fassung</title> <type>text</type> <author>Stahl, Hermann, 1843-1908.</author> <date>1911</date> <pages>116</pages> <language>ger</language> 3 <notes>Computer file. Ithaca, NY : Cornell University Library, 1990. 116 image files.Files for the images of individual pages are encoded in ALDUS/Microsoft TIFF Version 5.0 using facsimile-compatible CITT Group 4 compression.</notes> </rfc1807> The same document using the Dublin Core Simple Format: http://mathbooks.library.cornell.edu:8085/Dienst?verb=GetRecord&metadataPrefix=oai_dc& identifier=OAI:CULmath:cul.math/00060001 <oai_dc:dc> <dc:title>Abriss einer theorie der algebraischen funktionen einer verèanderlichen in neuer fassung</dc:title> <dc:creator>Stahl, Hermann, 1843-1908.</dc:creator> <dc:subject>Algebraic functions.</dc:subject> <dc:subject>Functions of complex variables.</dc:subject> <dc:publisher>Cornell University Library</dc:publisher> <dc:date>1911</dc:date> <dc:type>text</dc:type> <dc:identifier>http://mathbooks.library.cornell.edu:8085/GetRecord?id=cul.math/00060001 </dc:identifier> <dc:identifier>cul.math/00060001</dc:identifier> <dc:language>ger</dc:language> <dc:rights>Copyright 2002 Cornell University Library</dc:rights> </oai_dc:dc> Short inspection shows the followings problems: - The journal from which the article is taken is not directly available for the Euclid sets (annm stands for “Annals of Mathematics”), the year, volume, issue and page numbers are missing. - The second entry uses capitals for title and author. - There are several misprint in the abstract plus erroneous html tags. - There are no URLs to access the article, this can only be deduced from other information (the URL would look like http://projecteuclid.org/Dienst/UI/1.0/Display/euclid.annm/ 1105737690). This is only available with the DC format from HMM. - The given date in Project Euclid is not the date of publication. - For the books in the HMM collection, the publisher is given only in the rfc1807 format, and misleadingly given as “Cornell University Library” in the DC format. - The HMM data are not using the UTF-8 encoding that they are promising in their header. - Capitalisation in the German titles seems arbitrary. - The author fields contains additional information on the lifespan of the author, this is not intended in DC and not admissible in rfc1807. Overall this information is insufficient for the exchange of digitization information, although the rfc1807 format in particular could be enhanced to get richer and more rigid information: - The field “OTHER_ACCESS“ could be used to give the URL (homepage) of the given document. - DATE should be the publication date, the field “ENTRY” gives the date of “creating this bibliographic record”, this could be the same date as the production of the digital document: in some sense this “bibliographic record” is the final stage of the digitization process. - The “SERIES” field is meant to contain information on the journal, volume and issue. 4 - Given the heritage if the rfc1807 format from technical reports, there is only a field “CRCATEGORY” for the classification according to the “Computer Reviews”, but an “MSCCATEGORY” would seem plausible, although extensions to the standard are not easily accomplished8. The minidml format used with the NUMDAM project The project “Numérisation de documents anciens mathématiques9“ (NUMDAM) is based at the “Cellule de Coordination Documentaire Nationale pour les Mathématiques10” (Cellule MathDoc) and provides access to digitisations of the most pre-eminent French mathematics journal. The Cellule MathDoc has also started a “mini-DML Project11” that already collects data from different digitization centres and makes them searchable; they have also developed a “minidml” metadata format. The NUMDAM project offers the standard DC and the minidml format through their OAI data provider12. Here is an example of an article from ASENS in DC and minidml format: DC: http://www.numdam.org/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:numda m.org:AIF_1973__23_1_157_0 <oai_dc:dc> <dc:creator>Gérard, R.</dc:creator> <dc:creator>Levelt, A. M.</dc:creator> <dc:title>Invariants mesurant l'irrégularité en un point singulier des systèmes d'équations différentielles linéaires</dc:title> <dc:date>1973</dc:date> <dc:identifier>AIF_1973__23_1_157_0</dc:identifier> <dc:identifier>http://www.numdam.org/item?id=AIF_1973__23_1_157_0</dc:identifier> <dc:identifier>oai:numdam.org:AIF_1973__23_1_157_0</dc:identifier> <dc:identifier>citation: Ann. Inst. Fourier 23, no.1, 157-195 (1973)</dc:identifier> <dc:identifier>MR 49 #10947</dc:identifier> <dc:identifier>Zbl 0243.35016</dc:identifier> </oai_dc:dc> minidml: http://www.numdam.org/oai?verb=GetRecord&metadataPrefix=minidml&identifier=oai:num dam.org:AIF_1973__23_1_157_0 <minidml> <author>Gérard, R.</author> <author>Levelt, A. M.</author> 8 9 10 11 12 While the XML Schema document is provided by the OAI, the underlying format referred to is “CS-TR-v2.1”, which is part of RFC 1807 and refers to CR-CATEGORY and not to MSC. The website is http://www.numdam.org/, an English version is available at http://www.numdam.org/en/. http://www-mathdoc.ujf-grenoble.fr/ http://www.numdam.org/minidml/, for more information on the project see “Introducing the mini-DML project” by Thierry Bouche (http://www.numdam.org/minidml/litterature/article-minidml.pdf). http://www.numdam.org/oai 5 <title>Invariants mesurant l'irrégularité en un point singulier des systèmes d'équations différentielles linéaires</title> <language>fr</language> <identifier scheme="internal">AIF_1973__23_1_157_0</identifier> <identifier scheme="url">http://www.numdam.org/item?id=AIF_1973__23_1_157_0</identifier> <identifier scheme="oai">oai:numdam.org:AIF_1973__23_1_157_0</identifier> <citation>Ann. Inst. Fourier 23, no.1, 157-195 (1973)</citation> <jtitle>Annales de l'institut Fourier</jtitle> <home>http://annalif.ujf-grenoble.fr</home> <provider> <name>Project Numdam</name> <home>http://www.numdam.org</home> </provider> <abbrev>Ann. Inst. Fourier</abbrev> <volume>23</volume> <issue>1</issue> <date>1973</date> <pages>157-195</pages> <format>application/pdf</format> <format>application/x-djvu</format> <reviewid service="zbl">0243.35016</reviewid> <reviewid service="mr">49 #10947</reviewid> </minidml> There are some minor problems with these formats. The OAI Repository Explorer complains about the DC format: [Error] re.C3IbZK:15:94: cvc-complex-type.2.4.a: Invalid content was found starting with element 'oai_dc:dc'. One of '{"http://www.openarchives.org/OAI/2.0/oai_dc/":dc}' is expected. (There seems to be a slash “/” missing after http://www.openarchives.org/OAI/2.0/oai_dc.) And there are mistakes in the minidml format as well: [Error] minidml.xsd:34:32: src-resolve: Cannot resolve the name 'minidml:ProviderType' to a(n) 'type definition' component. (the ProviderType is defined as providerType in the XML schema document) [Error] re.HJV7MN:24:17: cvc-complex-type.2.4.a: Invalid content was found starting with element 'home'. (the name of the element is jhome, not home). And actually, the encoding of the data is not UTF-8, but Latin-1 in spite of the XML header information. Nevertheless, the DC format contains full citation information in the <dc:identifier> citation:…</dc:identifier> construct. This is some sort of a Dublin Core Structured Value 13 and 13 See http://dublincore.org/documents/dcmi-dcsv/index.shtml for the full description, in particular, the format suggested is slightly different and might call for “citation=”, but that does not make a difference for the given considerations. 6 as such can be used if it is understood by the recipient. The unqualified dc:identifier presents the URL of the given document. It should also be noted that NUMDAM provides – if available – the numbers of the review articles in Math Reviews and Zentralblatt. This is extremely important, since these databases, together with some others like the “Jahrbuch über Fortschritte der Mathematik”, provide the anchors to the article in the web of mathematics. While this is in some sense sufficient, the richer information is available through the minidml format, providing the full name and the home page of the journals and the bibliographic information in separate fields. The complex type “provider” is a container for the identification of the digitization project, giving the name and the website. Additional fields not present in the example are: keyword, abstract, msc, issn, publisher, rights with essentially the obvious meaning. It should be noted that some of the fields like title, keyword or abstract can carry an additional language qualifier, the “scheme” used with the identifier and the “service” with the reviewed appear already in the example. The Propositions by Mathematical Reviews and Zentralblatt In January 2004, Jane Kister from Mathematical Reviews and Bernd Wegner from Zentralblatt MATH proposed “Standards for metadata for digitized mathematics”14. The proposal called essentially for the use of BibTEX as the format to use to provide information on the digitized items. An XML format was also presented which could serve as an exchange format for the OAI-PMH. Here are the proposed fields: AUTHOR: required One or more author names, last name first, connected by ‘and’, such as “Brelot, Marcel and Choquet, Gustave”, as they appear on the title page. TITLE: required The title of the item as it appears on the title page. JOURNAL: optional An abbreviation for the journal name; it is recommended that the MR or ZBL abbreviation be used. FJOURNAL: required Full name of journal as it appears on the title page of the journal. VOLUME; optional The volume number; this may include multiple volume numbers, such as “34/35” YEAR: required The year of publication; this may include multiple years, such as “1938-39”. NUMBER: optional The issue number; this may include multiple issues such as “2-3” PAGES: required The paging string; this may include ranges such as ‘1–23” or more complicated strings ISSN: optional The International Standard Serial Number; this may include hyphens 14 The proposal is available at http://www.wdml.org/standards/metadata.pdf. 7 URL: required The web location of this item. NOTE: optional For notes or miscellaneous information MRID: optional The identifier used in MathSciNet ZBLID: optional The identifier used in Zentralblatt MATH JFMID: optional The identifier used in Jahrbuch über die Fortschritte der Mathematik (JFM) This information can all be contained in the minidml format, and as the minidml, only applies to journals. The rules, which fields are optional and which are necessary might be somewhat disputed, for example, the abbreviated name of a journal might usually rather be used than the full name, and the review identifiers, if at all available, should definitely be part of the dataset. But in general, there is no serious contradiction between these suggestions and NUMDAM’s minidml, and only differences in the XML used to wrap the data. On the other hand, the minidml provides a more rigid and precise format. Suggestions I think that the minidml format should be taken as the starting point for the format used to exchange metadata between digitization projects. Since this is directed towards articles, for books and other formats, some extensions will be necessary. I propose to consider the following information. Editor To differentiate the role of an editor of a collection from the role of an author, this seems to be useful if books are considered. Note that this is quite a different role from the editor of a journal: a book usually has either author(s) or editor(s), but not both. So one could use a field “creator” for both author and editor, probably with the standard extension (ed.) for the editors. Alternatively, an additional field “editor” could be created. ISBN For books, the book number instead of the serial number is needed. This could be a new field “isbn” or instead a field like “standardNumber” with schemes issn and isbn. Note that this is not really an “identifier”, since one object may have several ISBNs. place of publication This is an essential part of the bibliographic information for some citation rules, so it should be present. This could be part of an enhanced “publisher” field as in: <publisher>Heidelberg, New York, Tokyo: Springer</publisher> (or vice versa, or with a different separator) or a complex type as in <publisher> <pname>Springer</pname> <pplace>Heidelberg, New York, Tokyo</pplace> </publisher>. 8 In addition to these extensions to the minidml format (which in the process probably could be given a more “authoritative” name like “wdml”), I would like to suggest some additional “best practise” rules for the implementation of the OAI service. sets The OAI allows the organisation of collections into sets. These are in some sense sub-collections that could be retrieved individually. It seems useful to give - each journal as a separate set, - sets referring to the first two digits of the MSC classification (if available). This would allow for the dedicated collection of particular journals or subjects. In addition, the setDescriptor can be used to give additional information on the journal that is not necessary for every single article. Simple Dublin Core could be used as a format for this15. Names of files NUMDAM uses a quite sophisticated schedule for the names and identifiers of the individual files that allow to identify the file if the citation is known and vice versa. It would seem useful if the same or a similar scheme would be used by the digitization projects for filenames and/or identifiers if possible. Character set In general, the character set used should be indicated in the XML header. In particular, the integration of further digitization projects (e.g. like RusDML) requires precise information on the encoding used. Probably UTF-8 should be used throughout. Apart from this, capitals should be used only where required, not for whole names or titles. Names of persons Although the “Last Name First” rule used with minidml is somewhat arbitrary, it will be useful for service providers to sort result sets and probably increase precision in the search. I regard this as a minimal standard, giving the name in the sorting order. An alternative might be a special symbol signifying the sorting word, e.g. <creator>Andrew ^Lloyd Webber</creator> but his might be confusing if visible and overlooked if invisible (like e.g. non-breaking space). I rather had separate fields for full name (as written) and sort form, but do not see how this can be accomplished without making the format quite a bit more complicated, e.g.: <creator> <writtenName>Andrew Lloyd Webber</writtenName> <sortName>Lloyd Webber, Andrew</sortName> <alternativeName> Andrew Lloyd-Webber</alternativeName> </creator> for each single creator. But this form would allow additionally the reference to a name authority file (if available) and so could solve the infamous Cebychev (or Tschebytscheff, or Tchebychef …) problem. Finally, with the inclusion of Latin American or Chinese authors, anything less might be insufficient. 15 See http://www.openarchives.org/OAI/2.0/guidelines-repository.htm#setDescription 9 In general, time seems ripe to get to an agreement on the format for data exchange between digitization projects and providers of mathematics in digital format in general. Consensus is already sufficiently advanced and should be brought to fruition as soon as possible. 10

A Note on the Formats used for Digitization Information

Related documents

Products

Support

A Note on the Formats used for Digitization Information

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib