Practical Citation in a World of Evolving Data 23 March 2007 John Kunze, California Digital Library Why cite? Scholarship needs to reference works that – support, refute, extend, credit, or somehow relate to the citing thing Citing things (those containing citations): – Traditional: article, bibliography, CV, etc – Modern: anything containing hyperlinks (web page, email message, database cell) Cited things can be hard to manage – Things often inconvenient to handle directly – Citations are an easy way to automate handling of diverse objects through uniform surrogates What's a citation? (1/3) Def. 1 (UTas): Information identifying a publication. “That big red volume under the bust of Homer.” OK, but has problems • Want a reasonably unique set of descriptors • Want to identify things not necessarily published. What's a citation? (2/3) Def. 2 (CSUOhio): The minimum amount of information needed to identify or to locate an item quickly and efficiently. “LC Control Number: sn 83004174” Better, but still has problems • Small is good but we don’t care strictly about minimality • Want citation to be a kind of readable narrative or sequence of information points What's a citation? (3/3) Def. 3 (jak-07): A sequence of tokens that can be compared with an object for a match. Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):9991001. Comments – Ideally a citation serves as a smallish object surrogate to help us uniquely identify, talk about, find, use, and manage information objects – But citations are human creations that will sometimes be too small, large, or ambiguous for some purposes Modern data citation Informal citation (going strong) revolutionized by precision, concision, and access convenience of the URL, eg, – A simple URL plus text in an email message – A URL plus Title in an RSS feed Challenge: what role will URLs play in formal citation? URLs are so convenient that they’re being used in formal scholarly citation despite many problems – Often broken or incorrect soon after publication – Created inconsistently, some by citers, some by providers – No standard to indicate content that will/won’t/may change – No standard to indicate hierarchy, versions, change history Inside a modern citation The citation token sequence can be broken down: 1. Descriptive information tokens (title, author, etc) – Formal citation still requires the narrative that is provided by the descriptive tokens 2. A URL, if any, leading to the object – Modern user expectations expect/demand URLs – So high priority, the URL is the smallest citation; if you tossed tokens to save space, the URL would be last – To future-proof the URL concept, consider the generic term “actionable object identifier” instead Axoid: actionable object identifier An actionable object identifier (today’s URL) is a string associated with an object that can be submitted to widely available software (eg, web browsers) to gain direct access to the object – Abbreviation: axoid = actionable object identifier – Axoid a future-proof term for function of today’s URL? Identifiers that are not axoids – Machine-usable identifier sequences (eg, as Google search strings) such as “Author: Mark Twain” – Hack-actionable ids, eg, “Bioessays 17(11):999-1001” (modern citation) - (traditional citation) = (actionable object identifier) Modern citations are really traditional object descriptions tweaked to admit URLs Actionable object identifiers play a pivotal role – Smallest citation (implications for identity, and permanence of the access experience) – A machine-readable token (implications for future additional leverage points in provider-to-user communication) Immodest data citation wish list Want: formal citations to/for everything… – All databases – All levels of granularity (table, row, column) – For any snapshot (version, possibly time) – All formatted views: XML, HTML, custom, etc., with or without annotations – Access to older, newer, and latest versions – Plus actionability (“Click-through”) – Plus persistence (validity into the future) Data user vs data provider Mismatched expectations: User wants ref to Evolving service Specific version Arbitrary snapshot Archival reference Self-archived dump A standard way to learn what a data provider can do Provider can do Yes or no Yes, no, maybe Yes, no, maybe Yes, no, maybe Yes, no, maybe NO Provider-supplied expectations Assumptions • No user can get what a provider doesn’t give, ie, don’t expect service unless provider says • Services are hard, with most providers concentrating on active development of an evolving data service • No provider can do everything, so offerings will vary • Without a model for data provision and terminology referencing the model, there will never be a standard way for providers to support our citation wish list Starter terminology Beyond active development of an evolving data service, the next easiest citation service is to provide download access to a documented snapshot (dated with defined version) saved in external dump format Much harder: ongoing access to the old experience of a snapshot while maintaining the old interfaces Starter characterization of database services: – Evolving - (default) evolving data and interface – Passive snapshot - external dump having a download URL indicating a specific version – Active snapshot - ongoing interface having a URL indicating a specific version Citation as communication “Cite me as follows” - provider-to-citer “OK” - provider-to-user-via-citer “I made this up by taking a provider URL, altering the version number and stripping the tail component to get to its parent; the URL returned something so I guess it’s ok” - citer-to-user Dicey in general (“guess it’s ok”), is it useful to support standard inferencing about things like • Version numbers, eg, in URL (axoid) suffix “.vN” – First is N=1, next is N+1, latest is no suffix or “.v0” • Hierarchy, eg, in URL use of ‘/’ as per ARK identifiers • Variants, eg, in URL use of ‘.’ as per ARK identifiers Note: inferencing in citation aside from in URL (axoid) maybe hard Data sets vs documents Might documents and data have a common citation model? Data Document Systematically organized yes yes Hierarchical yes yes Yes, with metadata for semantics help Sorta, with schema structure (TEI, CCS docs) Machine readable Units of reference Data Document Whole Eg, IUPHAR Eg, Encyclopedia Part Table, row, column,cell, Yes Chapter, section, page, subpage Edition, revision, printing Codebook, encoding, context TOC, index, table of illustrations Version Metadata Views and units of delivery Presentation Machine readable Whole Data Document HTML Paper, HTML, PDF XML, CSV XML, SGML, TEI, ASCII All 20 volumes, CDROM FTP, CDROM Progressive navigation HTTP hierarchy browse HTTP hierarchy browse Human-friendly doses User-interfacedependent “page”, “citation” Data citation might-settle-for list All databases? Too hard, just supported databases All levels of granularity? For any snapshot? All views? Citable granules, versions, and views as defined and advertised by the data provider – Conventions would be very useful (eg, ARK); versioning below database level considered harmful (PB) Access to older/newer version, & latest version? Some pointer to change history; version inferencing With or without annotations? Deferred What about actionability and persistence? Required, so long as we know what experience we mean to persist