Practical Citation in a World of Evolving Data 23 March 2007

advertisement
Practical Citation in a World of
Evolving Data
23 March 2007
John Kunze, California Digital Library
Why cite?
Scholarship needs to reference works that
– support, refute, extend, credit, or somehow relate
to the citing thing
Citing things (those containing citations):
– Traditional: article, bibliography, CV, etc
– Modern: anything containing hyperlinks (web
page, email message, database cell)
Cited things can be hard to manage
– Things often inconvenient to handle directly
– Citations are an easy way to automate handling of
diverse objects through uniform surrogates
What's a citation?
(1/3)
Def. 1 (UTas): Information identifying a publication.
“That big red volume under the bust of Homer.”
OK, but has problems
• Want a reasonably unique set of descriptors
• Want to identify things not necessarily published.
What's a citation?
(2/3)
Def. 2 (CSUOhio): The minimum amount of
information needed to identify or to locate an item
quickly and efficiently.
“LC Control Number: sn 83004174”
Better, but still has problems
• Small is good but we don’t care strictly about
minimality
• Want citation to be a kind of readable narrative or
sequence of information points
What's a citation?
(3/3)
Def. 3 (jak-07): A sequence of tokens that can be
compared with an object for a match.
Bard JB and Davies JA. Development, Databases
and the Internet. Bioessays. 1995 Nov;17(11):9991001.
Comments
– Ideally a citation serves as a smallish object surrogate
to help us uniquely identify, talk about, find, use, and
manage information objects
– But citations are human creations that will sometimes
be too small, large, or ambiguous for some purposes
Modern data citation
Informal citation (going strong) revolutionized by precision,
concision, and access convenience of the URL, eg,
– A simple URL plus text in an email message
– A URL plus Title in an RSS feed
Challenge: what role will URLs play in formal citation? URLs
are so convenient that they’re being used in formal
scholarly citation despite many problems
– Often broken or incorrect soon after publication
– Created inconsistently, some by citers, some by providers
– No standard to indicate content that will/won’t/may change
– No standard to indicate hierarchy, versions, change history
Inside a modern citation
The citation token sequence can be broken down:
1. Descriptive information tokens (title, author, etc)
–
Formal citation still requires the narrative that is
provided by the descriptive tokens
2. A URL, if any, leading to the object
–
Modern user expectations expect/demand URLs
–
So high priority, the URL is the smallest citation; if you
tossed tokens to save space, the URL would be last
–
To future-proof the URL concept, consider the generic
term “actionable object identifier” instead
Axoid: actionable object identifier
An actionable object identifier (today’s URL) is a
string associated with an object that can be
submitted to widely available software (eg, web
browsers) to gain direct access to the object
–
Abbreviation: axoid = actionable object identifier
–
Axoid a future-proof term for function of today’s URL?
Identifiers that are not axoids
–
Machine-usable identifier sequences (eg, as Google
search strings) such as “Author: Mark Twain”
–
Hack-actionable ids, eg, “Bioessays 17(11):999-1001”
(modern citation) - (traditional citation)
= (actionable object identifier)
Modern citations are really traditional object
descriptions tweaked to admit URLs
Actionable object identifiers play a pivotal role
–
Smallest citation (implications for identity, and
permanence of the access experience)
–
A machine-readable token (implications for future
additional leverage points in provider-to-user
communication)
Immodest data citation wish list
Want: formal citations to/for everything…
– All databases
– All levels of granularity (table, row, column)
– For any snapshot (version, possibly time)
– All formatted views: XML, HTML, custom,
etc., with or without annotations
– Access to older, newer, and latest versions
– Plus actionability (“Click-through”)
– Plus persistence (validity into the future)
Data user vs data provider
Mismatched expectations:
User wants ref to
Evolving service
Specific version
Arbitrary snapshot
Archival reference
Self-archived dump
A standard way to
learn what a data
provider can do
Provider can do
Yes or no
Yes, no, maybe
Yes, no, maybe
Yes, no, maybe
Yes, no, maybe
NO
Provider-supplied expectations
Assumptions
• No user can get what a provider doesn’t give, ie, don’t
expect service unless provider says
• Services are hard, with most providers concentrating
on active development of an evolving data service
• No provider can do everything, so offerings will vary
• Without a model for data provision and terminology
referencing the model, there will never be a standard
way for providers to support our citation wish list
Starter terminology
Beyond active development of an evolving data service,
the next easiest citation service is to provide
download access to a documented snapshot (dated
with defined version) saved in external dump format
Much harder: ongoing access to the old experience of a
snapshot while maintaining the old interfaces
Starter characterization of database services:
– Evolving - (default) evolving data and interface
– Passive snapshot - external dump having a download URL
indicating a specific version
– Active snapshot - ongoing interface having a URL indicating
a specific version
Citation as communication
“Cite me as follows” - provider-to-citer
“OK” - provider-to-user-via-citer
“I made this up by taking a provider URL, altering the version
number and stripping the tail component to get to its parent; the
URL returned something so I guess it’s ok” - citer-to-user
Dicey in general (“guess it’s ok”), is it useful to support standard
inferencing about things like
• Version numbers, eg, in URL (axoid) suffix “.vN”
– First is N=1, next is N+1, latest is no suffix or “.v0”
• Hierarchy, eg, in URL use of ‘/’ as per ARK identifiers
• Variants, eg, in URL use of ‘.’ as per ARK identifiers
Note: inferencing in citation aside from in URL (axoid) maybe hard
Data sets vs documents
Might documents and data have a
common citation model?
Data
Document
Systematically
organized
yes
yes
Hierarchical
yes
yes
Yes, with metadata
for semantics help
Sorta, with
schema structure
(TEI, CCS docs)
Machine readable
Units of reference
Data
Document
Whole
Eg, IUPHAR
Eg, Encyclopedia
Part
Table, row,
column,cell,
Yes
Chapter, section,
page, subpage
Edition, revision,
printing
Codebook,
encoding, context
TOC, index, table
of illustrations
Version
Metadata
Views and units of delivery
Presentation
Machine readable
Whole
Data
Document
HTML
Paper, HTML, PDF
XML, CSV
XML, SGML, TEI,
ASCII
All 20 volumes,
CDROM
FTP, CDROM
Progressive
navigation
HTTP hierarchy
browse
HTTP hierarchy
browse
Human-friendly
doses
User-interfacedependent
“page”, “citation”
Data citation might-settle-for list
All databases? Too hard, just supported databases
All levels of granularity? For any snapshot? All views?
Citable granules, versions, and views as defined and
advertised by the data provider
– Conventions would be very useful (eg, ARK); versioning
below database level considered harmful (PB)
Access to older/newer version, & latest version?
Some pointer to change history; version inferencing
With or without annotations? Deferred
What about actionability and persistence? Required, so
long as we know what experience we mean to persist
Download