ICPR 19771998 Archive MARC catalog systems. - Facilitate description & information retrieval - Describe data in archives - Describe as works not media - Provide author, title, version. [Avram 1975] [Dodd 1979] [ISBD 1990] [ISO 1997] - Facilitate access & persistence - Cite research data in all publications that use it. - Provide actionable URI’s - Provide persistent identifiers - Use persistent institutions [Altman, et al. 2001] [Ryssevik & Musgrave 2001] 19992003 NESSTAR Virtual Data Center 20042009 TIB DOI Service Dataverse Network - Facilitate verification & reproducibility - Provide bit- or semantic- fixity - Provide granularity [Brase 2004] [Buneman 2006] [Altman & King 2007] Dataverse Network DataCite Data Dryad FigShare Data Citation Index - Facilitate integration - Include data citations in standard locations in text - Index data citations in existing catalogs - Integrate data citation with [Uhlir (ed.) 2012] [CODATA 2013] [Data Synthesis Group 2014] 2009- Citation Principles and Standards 1 Prepared for Computational Challenges in Data Citation April 2014 Citation Principles and Standards Dr. Micah Altman <escience@mit.edu> Director of Research, MIT Libraries Non-Resident Senior Fellow, The Brookings Institution DISCLAIMER These opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators Secondary disclaimer: “It’s tough to make predictions, especially about the future!” -- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc. Citation Principles and Standards 3 Collaborators & Co-Conspirators • Merce Crosas, Harvard University • Data Citation Synthesis Group • CO-Data Task Group on Data Citation • Research Support Thanks to the the NSF, NIH, IMLS, Sloan Foundation, the Joyce Foundation, the Judy Ford Watson Center for Public Policy, Amazon Corporation Citation Principles and Standards 4 Related Work • • • • • • • • • • M. Altman, and M.P. McDonald. (2014) “Public Participation GIS : The Case of Redistricting.” Proceedings of the 47th Annual Hawaii International Conference on System Sciences. Computer Society Press (IEEE). Novak K, Altman M, Broch E, Carroll JM, Clemins PJ, Fournier D, Laevart C, Reamer A, Meyer EA, Plewes T. Communicating Science and Engineering Data in the Information Age. National Academies Press; 2011. Micah Altman, Simon Jackman (2011) Nineteen Ways of Looking at Statistical Software, 1-12. In Journal Of Statistical Software 42 (2). Micah Altman, Jonathan Crabtree (2011) Using the SafeArchive System : TRAC-Based Auditing of LOCKSS, 165-170. In Archiving 2011. Micah Altman, Jeff Gill, Michael McDonald (2003) Numerical issues in statistical computing for the social scientist. In John Wiley & Sons. Altman, M., & Crabtree, J. 2011. Using the SafeArchive System : TRAC-Based Auditing of LOCKSS. Archiving 2011 (pp. 165–170). Society for Imaging Science and Technology. M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C. 2009. "Digital preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences." The American Archivist. 72(1): 169-182 Data Synthesis Task Group. 2014. Joint Principles for Data Citation. CODATA Data Citation Task Group, 2013. Out of Cite, Out of Mind: The Current State of Practice, Policy and Technology for Data Citation. Data Science Journal [Internet]. 2013;12:1–75. NDSA, 2013. National Agenda for Digital Stewardship, Library of Congress. Reprints available from: informatics.mit.edu Citation Principles and Standards 5 How did we get here? Citation Principles and Standards 6 Exemplar Systems ICPR 19771998 Archive MARC catalog systems. Core Principles Key Work - Facilitate description & information retrieval - Describe data in archives - Describe as works not media - Provide author, title, version. [Avram 1975] [Dodd 1979] [ISBD 1990] [ISO 1997] - Facilitate access & persistence - Cite research data in all publications that use it. - Provide actionable URI’s - Provide persistent identifiers - Use persistent institutions [Altman, et al. 2001] [Ryssevik & Musgrave 2001] 19992003 NESSTAR Virtual Data Center 20042009 TIB DOI Service Dataverse Network - Facilitate verification & reproducibility - Provide bit- or semantic- fixity - Provide granularity [Brase 2004] [Buneman 2006] [Altman & King 2007] Dataverse Network DataCite Data Dryad FigShare Data Citation Index - Facilitate integration - Include data citations in standard locations in text - Index data citations in existing catalogs - Integrate data citation with [Uhlir (ed.) 2012] [CODATA 2013] [Data Synthesis Group 2014] 2009- Source: Altman & Crosas, 2014. The Evolution of Citation PrinciplesQuarterly. and Standards Forthcoming. Data Citation. IASSIST 7 The Joint Principles for Data Citation Citation Principles and Standards 8 Significance & Scope • Sound, reproducible scholarship rests upon a foundation of robust, accessible data. • Data should be considered legitimate, citable products of research. • Data citation, like the citation of other evidence and sources, is good research practice. • The Joint Principles cover purpose, function and attributes of citations. • Specific practices vary across communities and technologies – we recommend communities develop practices for machine and human citations consistent with these general principles. Citation Principles and Standards 9 The Noble Eight-Fold Path to Citing Data • • • • • • • • Importance. Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications[1]. Credit and attribution. Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data[2]. Evidence. In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited [3]. Unique Identification. A data citation should include a persistent method for identification that is machine-actionable, globally unique, and widely used by a community[4]. Access. Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data[5]. Persistence. Unique identifiers, and metadata describing the data and its disposition, should persist -- even beyond the lifespan of the data they describe[6]. Specificity and verifiability. Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific timeslice, version and/or granular portion of data retrieved subsequently is the same as was originally cited[7]. Interoperability and flexibility. Data citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they Citation Principles and Standards 10 compromise interoperability of data citation practices across communities[8]. An Example Citation Principles and Standards 11 Placement of Citations Intra-work: ● Should provide sufficient information to identify cited data reference within included reference list. ● Citation to data should be in close proximity to claims relying on data. [Principle 3] ● May include additional information identifying specific portion of data related supporting that claim. [Principle 7] Example: The plots shown in Figure X show the distribution of selected measures from the main data [Author(s), Year, portion or subset used]. Full Citation: Citation may vary in style, but should be included in the full reference list along with citations to other types works. Example: References Section Author(s), Year, Article Title, Journal, Publisher, DOI. Author(s), Year, Dataset Title, Data Repository or Archive, Version, Global Persistent Identifier. Author(s), Year, Book Title, Publisher, ISBN. Citation Principles and Standards 12 Generic Data Citation (as it appears in printed reference list) Principle 2: Credit and Attribution (e.g. authors, repositories or other distributors and contributors) Principle 4: Unique Identifier (e.g. DOI, Handle.). Principle 5, 6 Access, Persistence: A persistent identifier that provides access and metadata Author(s), Year, Dataset Title, Data Repository or Archive, Version, Global Persistent Identifier Principle 7: Specificity and verification (e.g. the specific version used). Versioning or timeslice information should be supplied with any updated or dynamic dataset. Note: ● Neither the format nor specific required elements are intended to be defined with this example. Formats, optional elements, and required elements will vary across publishers and communities. [Principle 8: Interoperability and flexibility]. ● As illustrated in the previous examples, intra-work citations may be accompanied with information including the specific portion used. [Principles 7,8]. ● As illustrated in the next example, printed citations should be accompanied by metadata that Citation Principles and Standards[Principles 2, 5 and 7]. 13 support credit, attribution, specificity, and verification. Citation Metadata Author(s), Year, Dataset Title, Data Repository or Archive, Version, Global Persistent Identifier. Metadata retrieval <!--- CONTRIBUTOR METADATA --> <contributor role=” ORCIDid=”>Name</contributor> <!-- FIXITY and PROVENANCE -<fixity type=”MD5”>XXXX</fixity> <fixity type=”UNF”>UNF:XXXX</fixity> <!-- MACHINE UNDERSTANDABILITY -> <content type>data</content type> <format>HDF5</format> EXAMPLE METADATA Note: ● Metadata location, formats, and elements will vary across publishers and communities. [Principle 8] ● Citation metadata is needed in addition to the information in the printed citation. ● Metadata describing the data and its disposition should persist beyond the lifespan of the data. [Principle 6] ● Citation metadata should support attribution and credit [Principle 2]; machine use [Principle 5]; specificity and verification [principle 7] ● For example, additional citation metadata may be embedded in the citing document; attached to the persistent identifier for the citation, through its resolution service; stored in a separate community indexing service (e.g. DataCite, CrossRef); or provided in a machine-readable way through the surrogate (“landing page”) presented by the repository to which the identifier is resolved. For more detail, see the References section. http://www.force11.org/node/4772 Citation Principles and Standards 14 Implementations Citation Principles and Standards 15 Replication Data Publishing – With Citation Minting ICPSR Replication Archive • • • • Traditional disciplinary data archive Minimal cataloging and storage for free Fully curated open-data model for deposit fee Fully Curated membership model Dataverse Network • • • • • • • Open Source System Hubs run at Harvard other universities Archives data Generates persistent identifiers (handles, DOI’s forthcoming) Generates resolvable citations Versioned Harvard Library Dataverse now part of DataCite, Data-PASS preservation network Citation Principles and Standards FigShare • • • • • Closed source No charge Archives data Supports DOI’s, ORCIDS Preserved in CLOCKSS 16 General Data Sharing Dataverse Network • • • • • • • • Open Source System Hubs run at Harvard other universities Archives data Generates persistent identifiers (handles, DOI’s forthcoming) Generates semantic fingerprints Generates resolvable citations Versioned Harvard Library Dataverse now part of DataCite, Data-PASS preservation network Scientific Data Journal • • • FigShare • • • • • Scientific data publishing journal Published “data papers” Nature publishing group Closed source No charge Archives data Supports DOI’s, ORCIDS Preserved in CLOCKSS CKAN • Also, GigaScience Citation Principles and Standards • • • Open source DIY Hosting – you host Based on Drupal 17 Citation Principles and Standards 18 Data Citation Indexing DataCite Data Citation Index • • • Commercial Service (Thomson Reuters) Indexes many large repositories (e.g. Data-PASS) Beginning to extract citations from TR publications Linking • • • • DOI registry service (DOI provider) Data DOI metadata indexing service (parallel to CrossRef) Not-for-profit membership Organization Collaborating with ORCID-EU to embed ORCIDs Citation Principles and Standards SciDirect Linking • • • • DOI, entity, and application linking Limited number of repositories Elsevier jorunals http://www.dlib.org/dli b/january11/aalbersber g/01aalbersberg.html 19 Community Implementation Efforts Joint Data Citation Implementation Group Research Data Alliance Data Citation Working Group • • • • Overlaps with synthesis group Focuses on dynamic data citation Early-mid stage Lead by Andreas Rauber rd-alliance.org/workinggroups/data-citation-wg.html • • • • Overlaps with citation synthesis group Focuses on broad range of implementation Early stages Lead by Tim Clark www.force11.org/datacitationimplementation Citation Principles and Standards 20 Basic Research - Modeling • Models of identity… what should be cited? – The equivalence question. How does one determine whether two data objects, not bitwise identical, are semantically equivalent – The versioning question. How does one unambiguously assign at the time of citation a ‘version’ to a data object, such that someone referencing the citation later can retrieve or recreate the data object in the same state that it was at the time of citation? – The granularity question. How does one unambiguously describe components and/or subsets of a data object for purposes of computations, provenance, and attribution? • Models of provenance … how to integrate citation & curation? – How to link citation to chain of ownership & history of transformations applied to it? – How to instrument citation to support verification of provenance & authenticity? • Models of attribution how to communicate and support attribution through citation? – Legal attribution: how to identify intellectual property rights and licenses – Ontological attribution: how to classify the semantic relationship between citing and cited objects. – Scholarly attribution: how to communicate and incent systems of scholarly credit and evaluation. Citation Principles and Standards 21 Data Science Journal, Volume 12, 13 Septemb Research to Enable Ubiquitous, Seamless, Interoperable Data Citation SocialCultural Theory Identity Incentives and Barriers to Citation • Definitional • Equivalence Provenance • Attributes • Models Incentives and Barriers to Curation and Publication Translational Applied Significant Properties Canonical Forms Data Modeling and Decomposition Semantic Fingerprints Provenance Models Efficient Versioning Algorithms Licenses Efficient Provenance Tracking Metrics Attribution Roles Scholarly Incentives Nanopublication Evaluation Citation Policies and Behaviors Citation Infrastructure Reliability Attribution • Intellectual Property • Scholarly Credit • Microattribution Source: CoData 2013 Citation Principles and Standards Effectiveness of Interventions to Promote Citation 22 Example Computational/Informatic Challenges • Applied Research … are causally relevant (algorithms, software, & taxonomies… oh my!) – Algorithms & software to compute semantic fingerprints on … … social network graphs? … video streams? … relational databases? … GIS coverages? – Practical taxonomies, ontologies, & data mining methods for extracting… … ontological citation relationships … contributor roles • Evaluation Research (data mining & applied magic?) – What proportion of of tables, figures, statistical coefficients, and other data-derived objects in research articles … … include citations? … can be resolved to data? … identify the subset of data supporting the specific result? … sufficient provenance to verify that the version of the data referenced corresponds to the version cited? • Translational Research (out of the fire and into...) – Citation metrics that are … … statistically reliable … externally valid … yield accurate predictive inference Citation Principles and Standards 23 Notes Notes [1] CODATA 2013: sec 3.2.1; Uhlir (ed.) 2012, ch 14; Altman & King 2007 [2] CODATA 2013, Sec 3.2; 7.2.3; Uhlir (ed.) 2012,ch. 14 [3] CODATA 2013, Sec 3.1; 7.2.3; Uhlir (ed.) 2012, ch. 14 [4] Altman-King 2007; CODATA 2013, Sec 3.2.3, Ch. 5; Ball & Duke 2012 [5] CODATA 2013, Sec 3.2.4, 3.2.5, 3.2.8 [6] Altman-King 2007; Ball & Duke 2012; CODATA 2013, Sec 3.2.2 [7] Altman-King 2007; CODATA 2013, Sec 3.2.7, 3.2.8 [8] CODATA 2013, Sec 3.2.10 Citation Principles and Standards 24 References References • M. Altman & G. King, 2007. A Proposed Standard for the Scholarly Citation of Quantitative Data, D-Lib • Ball, A., Duke, M. (2012). ‘Data Citation and Linking’. DCC Briefing Papers. Edinburgh: Digital Curation Centre. • CODATA-ICSTI Task Group on Data Citation, 2013; Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data. Data Science Journal • P. Uhlir (ed.),2011. For Attribution -- Developing Data Attribution and Citation Practices and Standards. National Academies of Sciences • Altman, M, et al. (2001). A Digital Library for the Dissemination and Replication of Quantitative Social Science Research: The Virtual Data Center, Social Science Computer Review. • Altman, M, and King,G. (2007) . "A proposed standard for the scholarly citation of quantitative data." Dlib Magazine 13.3/4. <http://www.dlib.org/dlib/march07/altman/03altman.html> • Avram, H. D. (1975). MARC, its history and implications.. Washington: Library of Congress, • Brase, J. (2004) "Using digital library techniques–registration of scientific primary data." In Research and advanced technology for digital libraries, pp. 488-494. Springer Berlin Heidelberg,. • Buneman, P. (2006). How to cite curated databases and how to make them citable. Proceedings of the 18th International Conference on Scientific and Statistical Database Management (pp. 195-203). Los Alamitos, CA: IEEE Computer Society. • Data Citation Synthesis Group, (2014). Joint Declaration of Data Citation Principles, http://www.force11.org/datacitation • Dodd, S. A. (1979) "Bibliographic Reference for Numeric Social Science Data Files: Suggested Guidelines," American Society for Information Science Journal 30:2, 77-82.. • ISO, (1997). Information and documentation -- Bibliographic references -- Part 2: Electronic documents or parts thereof. 690-2:1997. International Standards Organization. • King, G.. (2007). “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing.” Sociological Methods and Research 36 (2): 173–99. • Ryssevik, J. & Musgrave S. (2001). The Social Science Dream Machine Social Science Computer Review Citation Principles and Standards 25 E-mail: Web: Questions? escience@mit.edu informatics.mit.edu Citation Principles and Standards 26