Data Citation Index - Computational Challenges in Data Citation

advertisement
ICPR
19771998
Archive
MARC
catalog systems.
- Facilitate description
& information retrieval
- Describe data in archives
- Describe as works not media
- Provide author, title, version.
[Avram 1975]
[Dodd 1979]
[ISBD 1990]
[ISO 1997]
- Facilitate access
& persistence
- Cite research data in all
publications that use it.
- Provide actionable URI’s
- Provide persistent identifiers
- Use persistent institutions
[Altman, et al. 2001]
[Ryssevik & Musgrave 2001]
19992003
NESSTAR
Virtual Data Center
20042009
TIB DOI Service
Dataverse Network
- Facilitate verification
& reproducibility
- Provide bit- or semantic- fixity
- Provide granularity
[Brase 2004]
[Buneman 2006]
[Altman & King 2007]
Dataverse Network
DataCite
Data Dryad
FigShare
Data Citation Index
- Facilitate integration
- Include data citations in
standard locations in text
- Index data citations in existing
catalogs
- Integrate data citation with
[Uhlir (ed.) 2012]
[CODATA 2013]
[Data Synthesis Group 2014]
2009-
Citation Principles and Standards
1
Prepared for
Computational Challenges in Data Citation
April 2014
Citation Principles and Standards
Dr. Micah Altman
<escience@mit.edu>
Director of Research, MIT Libraries
Non-Resident Senior Fellow, The Brookings Institution
DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi,
Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle,
George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White,
etc.
Citation Principles and Standards
3
Collaborators & Co-Conspirators
• Merce Crosas, Harvard University
• Data Citation Synthesis Group
• CO-Data Task Group on Data Citation
• Research Support
Thanks to the the NSF, NIH, IMLS, Sloan
Foundation, the Joyce Foundation, the Judy Ford
Watson Center for Public Policy, Amazon
Corporation
Citation Principles and Standards
4
Related Work
•
•
•
•
•
•
•
•
•
•
M. Altman, and M.P. McDonald. (2014) “Public Participation GIS : The Case of Redistricting.” Proceedings of the
47th Annual Hawaii International Conference on System Sciences. Computer Society Press (IEEE).
Novak K, Altman M, Broch E, Carroll JM, Clemins PJ, Fournier D, Laevart C, Reamer A, Meyer EA, Plewes T.
Communicating Science and Engineering Data in the Information Age. National Academies Press; 2011.
Micah Altman, Simon Jackman (2011) Nineteen Ways of Looking at Statistical Software, 1-12. In Journal Of
Statistical Software 42 (2).
Micah Altman, Jonathan Crabtree (2011) Using the SafeArchive System : TRAC-Based Auditing of LOCKSS, 165-170.
In Archiving 2011.
Micah Altman, Jeff Gill, Michael McDonald (2003) Numerical issues in statistical computing for the social scientist. In
John Wiley & Sons.
Altman, M., & Crabtree, J. 2011. Using the SafeArchive System : TRAC-Based Auditing of LOCKSS. Archiving 2011
(pp. 165–170). Society for Imaging Science and Technology.
M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C. 2009. "Digital
preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences." The American
Archivist. 72(1): 169-182
Data Synthesis Task Group. 2014. Joint Principles for Data Citation.
CODATA Data Citation Task Group, 2013. Out of Cite, Out of Mind: The Current State of Practice, Policy and
Technology for Data Citation. Data Science Journal [Internet]. 2013;12:1–75.
NDSA, 2013. National Agenda for Digital Stewardship, Library of Congress.
Reprints available from:
informatics.mit.edu
Citation Principles and Standards
5
How did we get here?
Citation Principles and Standards
6
Exemplar Systems
ICPR
19771998
Archive
MARC
catalog systems.
Core Principles
Key Work
- Facilitate description
& information retrieval
- Describe data in archives
- Describe as works not media
- Provide author, title, version.
[Avram 1975]
[Dodd 1979]
[ISBD 1990]
[ISO 1997]
- Facilitate access
& persistence
- Cite research data in all
publications that use it.
- Provide actionable URI’s
- Provide persistent identifiers
- Use persistent institutions
[Altman, et al. 2001]
[Ryssevik & Musgrave 2001]
19992003
NESSTAR
Virtual Data Center
20042009
TIB DOI Service
Dataverse Network
- Facilitate verification
& reproducibility
- Provide bit- or semantic- fixity
- Provide granularity
[Brase 2004]
[Buneman 2006]
[Altman & King 2007]
Dataverse Network
DataCite
Data Dryad
FigShare
Data Citation Index
- Facilitate integration
- Include data citations in
standard locations in text
- Index data citations in existing
catalogs
- Integrate data citation with
[Uhlir (ed.) 2012]
[CODATA 2013]
[Data Synthesis Group 2014]
2009-
Source: Altman & Crosas, 2014. The Evolution of
Citation
PrinciplesQuarterly.
and Standards Forthcoming.
Data Citation.
IASSIST
7
The Joint Principles
for Data Citation
Citation Principles and Standards
8
Significance & Scope
• Sound, reproducible scholarship rests upon a
foundation of robust, accessible data.
• Data should be considered legitimate, citable products
of research.
• Data citation, like the citation of other evidence and
sources, is good research practice.
• The Joint Principles cover purpose, function and
attributes of citations.
• Specific practices vary across communities and
technologies – we recommend communities develop
practices for machine and human citations consistent
with these general principles.
Citation Principles and Standards
9
The Noble Eight-Fold Path to Citing Data
•
•
•
•
•
•
•
•
Importance. Data should be considered legitimate, citable products of research. Data citations
should be accorded the same importance in the scholarly record as citations of other research
objects, such as publications[1].
Credit and attribution. Data citations should facilitate giving scholarly credit and normative and
legal attribution to all contributors to the data, recognizing that a single style or mechanism of
attribution may not be applicable to all data[2].
Evidence. In scholarly literature, whenever and wherever a claim relies upon data, the
corresponding data should be cited [3].
Unique Identification. A data citation should include a persistent method for identification that is
machine-actionable, globally unique, and widely used by a community[4].
Access. Data citations should facilitate access to the data themselves and to such associated
metadata, documentation, code, and other materials, as are necessary for both humans and
machines to make informed use of the referenced data[5].
Persistence. Unique identifiers, and metadata describing the data and its disposition, should
persist -- even beyond the lifespan of the data they describe[6].
Specificity and verifiability. Data citations should facilitate identification of, access to, and
verification of the specific data that support a claim. Citations or citation metadata should include
information about provenance and fixity sufficient to facilitate verifying that the specific timeslice,
version and/or granular portion of data retrieved subsequently is the same as was originally
cited[7].
Interoperability and flexibility. Data citation methods should be sufficiently flexible to
accommodate the variant practices among communities, but should not differ so much that they
Citation Principles and Standards
10
compromise interoperability of data
citation practices across communities[8].
An Example
Citation Principles and Standards
11
Placement of Citations
Intra-work:
● Should provide sufficient information to identify cited data reference within included
reference list.
● Citation to data should be in close proximity to claims relying on data. [Principle 3]
● May include additional information identifying specific portion of data related
supporting that claim. [Principle 7]
Example: The plots shown in Figure X show the distribution of selected measures from the main
data [Author(s), Year, portion or subset used].
Full Citation:
Citation may vary in style, but should be included in the full reference list along with citations to other
types works.
Example:
References Section
Author(s), Year, Article Title, Journal, Publisher, DOI.
Author(s), Year, Dataset Title, Data Repository or Archive, Version, Global Persistent Identifier.
Author(s), Year, Book Title, Publisher, ISBN.
Citation Principles and Standards
12
Generic Data Citation
(as it appears in printed reference list)
Principle 2: Credit and
Attribution (e.g. authors,
repositories or other
distributors and contributors)
Principle 4: Unique Identifier (e.g.
DOI, Handle.). Principle 5, 6
Access, Persistence: A persistent
identifier that provides access and
metadata
Author(s), Year, Dataset Title, Data Repository or Archive, Version, Global
Persistent Identifier
Principle 7: Specificity and verification (e.g. the specific
version used).
Versioning or timeslice information should be supplied with
any updated or dynamic dataset.
Note:
● Neither the format nor specific required elements are intended to be defined with this
example. Formats, optional elements, and required elements will vary across publishers and
communities. [Principle 8: Interoperability and flexibility].
● As illustrated in the previous examples, intra-work citations may be accompanied with
information including the specific portion used. [Principles 7,8].
● As illustrated in the next example, printed citations should be accompanied by metadata that
Citation Principles
and Standards[Principles 2, 5 and 7].
13
support credit, attribution, specificity,
and verification.
Citation Metadata
Author(s), Year, Dataset Title,
Data Repository or Archive,
Version, Global Persistent
Identifier.
Metadata
retrieval
<!--- CONTRIBUTOR METADATA -->
<contributor role=”
ORCIDid=”>Name</contributor>
<!-- FIXITY and PROVENANCE -<fixity type=”MD5”>XXXX</fixity>
<fixity type=”UNF”>UNF:XXXX</fixity>
<!-- MACHINE UNDERSTANDABILITY ->
<content type>data</content type>
<format>HDF5</format>
EXAMPLE METADATA
Note:
● Metadata location, formats, and elements will vary
across publishers and communities. [Principle 8]
● Citation metadata is needed in addition to the
information in the printed citation.
● Metadata describing the data and its disposition
should persist beyond the lifespan of the data.
[Principle 6]
● Citation metadata should support attribution and
credit [Principle 2]; machine use [Principle 5];
specificity and verification [principle 7]
● For example, additional citation metadata may be
embedded in the citing document; attached to the
persistent identifier for the citation, through its
resolution service; stored in a separate community
indexing service (e.g. DataCite, CrossRef); or provided
in a machine-readable way through the surrogate
(“landing page”) presented by the repository to which
the identifier is resolved.
For more detail, see the References section.
http://www.force11.org/node/4772
Citation Principles and Standards
14
Implementations
Citation Principles and Standards
15
Replication Data Publishing – With Citation
Minting
ICPSR Replication
Archive
•
•
•
•
Traditional disciplinary
data archive
Minimal cataloging and
storage for free
Fully curated open-data
model for deposit fee
Fully Curated
membership model
Dataverse Network
•
•
•
•
•
•
•
Open Source System
Hubs run at Harvard
other universities
Archives data
Generates persistent
identifiers (handles, DOI’s
forthcoming)
Generates resolvable
citations
Versioned
Harvard Library Dataverse
now part of DataCite,
Data-PASS preservation
network
Citation Principles and Standards
FigShare
•
•
•
•
•
Closed source
No charge
Archives data
Supports DOI’s, ORCIDS
Preserved in CLOCKSS
16
General Data Sharing
Dataverse Network
•
•
•
•
•
•
•
•
Open Source System
Hubs run at Harvard
other universities
Archives data
Generates persistent
identifiers (handles, DOI’s
forthcoming)
Generates semantic
fingerprints
Generates resolvable
citations
Versioned
Harvard Library Dataverse
now part of DataCite,
Data-PASS preservation
network
Scientific Data
Journal
•
•
•
FigShare
•
•
•
•
•
Scientific data
publishing journal
Published “data
papers”
Nature publishing
group
Closed source
No charge
Archives data
Supports DOI’s, ORCIDS
Preserved in CLOCKSS
CKAN
•
Also, GigaScience
Citation Principles and Standards
•
•
•
Open source
DIY Hosting – you host
Based on Drupal
17
Citation Principles and Standards
18
Data Citation Indexing
DataCite
Data Citation Index
•
•
•
Commercial Service
(Thomson Reuters)
Indexes many large
repositories
(e.g. Data-PASS)
Beginning to extract
citations from TR
publications
Linking
•
•
•
•
DOI registry service
(DOI provider)
Data DOI metadata
indexing service
(parallel to CrossRef)
Not-for-profit
membership
Organization
Collaborating with
ORCID-EU to embed
ORCIDs
Citation Principles and Standards
SciDirect Linking
•
•
•
•
DOI, entity, and
application linking
Limited number of
repositories
Elsevier jorunals
http://www.dlib.org/dli
b/january11/aalbersber
g/01aalbersberg.html
19
Community Implementation Efforts
Joint Data Citation
Implementation Group
Research Data Alliance
Data Citation Working Group
•
•
•
•
Overlaps with synthesis group
Focuses on dynamic data
citation
Early-mid stage
Lead by Andreas Rauber
rd-alliance.org/workinggroups/data-citation-wg.html
•
•
•
•
Overlaps with citation synthesis group
Focuses on broad range of
implementation
Early stages
Lead by Tim Clark
www.force11.org/datacitationimplementation
Citation Principles and Standards
20
Basic Research - Modeling
•
Models of identity…  what should be cited?
–
The equivalence question.
How does one determine whether two data objects, not bitwise identical, are semantically
equivalent
– The versioning question.
How does one unambiguously assign at the time of citation a ‘version’ to a data object, such
that someone referencing the citation later can retrieve or recreate the data object in the
same state that it was at the time of citation?
– The granularity question.
How does one unambiguously describe components and/or subsets of a data object for
purposes of computations, provenance, and attribution?
•
Models of provenance …  how to integrate citation & curation?
– How to link citation to chain of ownership & history of transformations applied to it?
– How to instrument citation to support verification of provenance & authenticity?
•
Models of attribution  how to communicate and support attribution through citation?
– Legal attribution: how to identify intellectual property rights and licenses
– Ontological attribution: how to classify the semantic relationship between citing and cited
objects.
– Scholarly attribution: how to communicate and incent systems of scholarly credit and
evaluation.
Citation Principles and Standards
21
Data Science Journal, Volume 12, 13 Septemb
Research to Enable Ubiquitous,
Seamless, Interoperable Data Citation
SocialCultural
Theory
Identity
Incentives and
Barriers to
Citation
• Definitional
• Equivalence
Provenance
• Attributes
• Models
Incentives and
Barriers to
Curation and
Publication
Translational
Applied
Significant
Properties
Canonical Forms
Data Modeling and
Decomposition
Semantic
Fingerprints
Provenance Models
Efficient Versioning
Algorithms
Licenses
Efficient Provenance
Tracking
Metrics
Attribution Roles
Scholarly Incentives
Nanopublication
Evaluation
Citation Policies
and Behaviors
Citation
Infrastructure
Reliability
Attribution
• Intellectual
Property
• Scholarly Credit
• Microattribution
Source: CoData 2013
Citation Principles and Standards
Effectiveness of
Interventions to
Promote Citation
22
Example Computational/Informatic
Challenges
• Applied Research
… are causally relevant
(algorithms, software, & taxonomies… oh my!)
– Algorithms & software to compute semantic
fingerprints on …
… social network graphs?
… video streams?
… relational databases?
… GIS coverages?
– Practical taxonomies, ontologies, & data
mining methods for extracting…
… ontological citation relationships
… contributor roles
• Evaluation Research
(data mining & applied magic?)
–
What proportion of of tables, figures, statistical
coefficients, and other data-derived objects in
research articles …
… include citations?
… can be resolved to data?
… identify the subset of data supporting the specific
result?
… sufficient provenance to verify that the version of
the data referenced corresponds to the version
cited?
• Translational Research
(out of the fire and into...)
–
Citation metrics that are …
… statistically reliable
… externally valid
… yield accurate predictive inference
Citation Principles and Standards
23
Notes
Notes
[1] CODATA 2013: sec 3.2.1; Uhlir (ed.) 2012, ch 14; Altman & King 2007
[2] CODATA 2013, Sec 3.2; 7.2.3; Uhlir (ed.) 2012,ch. 14
[3] CODATA 2013, Sec 3.1; 7.2.3; Uhlir (ed.) 2012, ch. 14
[4] Altman-King 2007; CODATA 2013, Sec 3.2.3, Ch. 5; Ball & Duke 2012
[5] CODATA 2013, Sec 3.2.4, 3.2.5, 3.2.8
[6] Altman-King 2007; Ball & Duke 2012; CODATA 2013, Sec 3.2.2
[7] Altman-King 2007; CODATA 2013, Sec 3.2.7, 3.2.8
[8] CODATA 2013, Sec 3.2.10
Citation Principles and Standards
24
References
References
•
M. Altman & G. King, 2007. A Proposed Standard for the Scholarly Citation of Quantitative Data, D-Lib
•
Ball, A., Duke, M. (2012). ‘Data Citation and Linking’. DCC Briefing Papers. Edinburgh: Digital Curation
Centre.
•
CODATA-ICSTI Task Group on Data Citation, 2013; Out of Cite, Out of Mind: The Current State of Practice,
Policy, and Technology for the Citation of Data. Data Science Journal
•
P. Uhlir (ed.),2011. For Attribution -- Developing Data Attribution and Citation Practices and Standards.
National Academies of Sciences
•
Altman, M, et al. (2001). A Digital Library for the Dissemination and Replication of Quantitative Social
Science Research: The Virtual Data Center, Social Science Computer Review.
•
Altman, M, and King,G. (2007) . "A proposed standard for the scholarly citation of quantitative data." Dlib Magazine 13.3/4. <http://www.dlib.org/dlib/march07/altman/03altman.html>
•
Avram, H. D. (1975). MARC, its history and implications.. Washington: Library of Congress,
•
Brase, J. (2004) "Using digital library techniques–registration of scientific primary data." In Research and
advanced technology for digital libraries, pp. 488-494. Springer Berlin Heidelberg,.
•
Buneman, P. (2006). How to cite curated databases and how to make them citable. Proceedings of the 18th
International Conference on Scientific and Statistical Database Management (pp. 195-203). Los Alamitos,
CA: IEEE Computer Society.
•
Data Citation Synthesis Group, (2014). Joint Declaration of Data Citation Principles,
http://www.force11.org/datacitation
•
Dodd, S. A. (1979) "Bibliographic Reference for Numeric Social Science Data Files: Suggested Guidelines,"
American Society for Information Science Journal 30:2, 77-82..
•
ISO, (1997). Information and documentation -- Bibliographic references -- Part 2: Electronic documents or
parts thereof. 690-2:1997. International Standards Organization.
•
King, G.. (2007). “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing.”
Sociological Methods and Research 36 (2): 173–99.
•
Ryssevik, J. & Musgrave S. (2001). The Social Science Dream Machine
Social Science Computer Review
Citation Principles and Standards
25
E-mail:
Web:
Questions?
escience@mit.edu
informatics.mit.edu
Citation Principles and Standards
26
Download